-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Resolution bucketing and Trainer implementation refactoring #11117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Resolution bucketing and Trainer implementation refactoring #11117
Conversation
|
I don't know where to ask this, but does your lora trainer support every model that ComfyUI supports? Also, can you please include optimizers from this project https://pytorch-optimizers.readthedocs.io/en/latest/? (some optimizers here greatly reduce the VRAM consumption.) |
|
Thank you for the answers. I have some more questions/requests for LORA training in ComfyUI. For 1: The video models do support image training as well, though that's almost useless since video models are used mostly for animations. If training with videos gets implemented, I would even consider dropping support musubi if video training gets supported. For 2: Some optimizers are "schedule free" and some optimizers don't require any hyperparameters, so extra optimizers would be so nice to have access to. As for the new requests, there are some training optimizations that GREATLY improve LORA training speed both in terms of iteration speed and convergence time like: The relatively simple to implement: Slightly harder to implement: There are some more mentioned here https://x.com/SwayStar123/status/1994673352754270318 Even just TREAD which seems to be the easiest to implement and is tested for LORA training will GREATLY improve convergence and iteration speed. |
Plz open issue/discussions or PR for your request, plz don't keep posting things unrelated to this PR/thread |
|
Hi, I tested this PR, here's my feedback:
|
Will check this issue soon |
@bezo97 I have pushed a fix for the gradient_checkpointing bug you mentioned, the bug should be resolved: |

In this PR I proposed a mechanism for resolution bucketing. Unlike standard Aspect Ratio Bucketing, we allow user to input arbitrary resolution latents, we directly do the bucketing on the list of latents and assume they already have user expected size.
(In this PR we also added "ResizeToPixelCount" node which can mimic the effect of standard ARB)
Beside Resolution bucketing things, we also fixed the issue in #10940 which lack some data moving cause bad tensor device.
And about Trainer refactoring, we are now split each task (like create adapter, training step in different modes) into separated functions to improve maintainability
We also used custom TrainGuider with modified load model helper to allow custom control on the loading behavior.
TL;DR: