Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] dynamic batch size with gradient accumulate #6533

Open
Xiang-cd opened this issue Sep 13, 2024 · 2 comments
Open

[REQUEST] dynamic batch size with gradient accumulate #6533

Xiang-cd opened this issue Sep 13, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Xiang-cd
Copy link

Is your feature request related to a problem? Please describe.
my truly minimal use case request: I have 2 datasets with resolutions 256 and 512, I want to build 2 dataloaders, one dataloader load
256x256 images with batch size 8, one load 512x512 images batch size 2.
it comes to a conflict with the note in the documentation:

Note: train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed.

so how to decide the train_micro_batch_size_per_gpu?
that is comes to a grounding problem: how deepspeed process gradient accumulation?

  1. judge the forward access of the model, no matter the batch size is what, the gradient accumulation is logical right
  2. judge the instance number, namely the number of data instances go through the model, such as if instances reaches to 32 to perform optimization, so for 512x512 and 256X256 data, the forward times will be different, and cause a logical promblem with mixed batch size training.

Describe the solution you'd like

  1. describe how deepspeed process gradient accumulation in the document
  2. it is better to judge the forward access of the model to perform gradient accumulation
  3. unlock the limitation of train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs

thank you for your great work.

@Xiang-cd Xiang-cd added the enhancement New feature or request label Sep 13, 2024
@Xiang-cd
Copy link
Author

hi, is there anyone?

@tjruwase
Copy link
Contributor

@Xiang-cd, gradient accumulation in deepspeed works as follows

  1. Assume each training iteration consists of fwd, bwd, step.
  2. Increment micro-step counter in step, and use configured gradient accumulation steps value to detect global step boundary.

No core functionality of DeepSpeed needs batch size information, and so the restriction on batch size, gradient accumulation, and GPU can be relaxed/eliminated.

Can you share more details of your dynamic batch size scenario. For example, is it similar to curriculum learning, which scales batch size and sequence length dynamically and works with DeepSpeed.

Can you share a repro for the error you are seeing? In particular, it would be good to see your ds_config and deepspeed.intialize call. There might be a simple workaround similar to the HF integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants