[REQUEST] dynamic batch size with gradient accumulate #6533

Xiang-cd · 2024-09-13T10:58:47Z

Is your feature request related to a problem? Please describe.
my truly minimal use case request: I have 2 datasets with resolutions 256 and 512, I want to build 2 dataloaders, one dataloader load
256x256 images with batch size 8, one load 512x512 images batch size 2.
it comes to a conflict with the note in the documentation:

Note: train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed.

so how to decide the train_micro_batch_size_per_gpu?
that is comes to a grounding problem: how deepspeed process gradient accumulation?

judge the forward access of the model, no matter the batch size is what, the gradient accumulation is logical right
judge the instance number, namely the number of data instances go through the model, such as if instances reaches to 32 to perform optimization, so for 512x512 and 256X256 data, the forward times will be different, and cause a logical promblem with mixed batch size training.

Describe the solution you'd like

describe how deepspeed process gradient accumulation in the document
it is better to judge the forward access of the model to perform gradient accumulation
unlock the limitation of train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs

thank you for your great work.

The text was updated successfully, but these errors were encountered:

Xiang-cd · 2024-09-18T01:32:50Z

hi, is there anyone?

tjruwase · 2024-09-19T14:57:33Z

@Xiang-cd, gradient accumulation in deepspeed works as follows

Assume each training iteration consists of fwd, bwd, step.
Increment micro-step counter in step, and use configured gradient accumulation steps value to detect global step boundary.

No core functionality of DeepSpeed needs batch size information, and so the restriction on batch size, gradient accumulation, and GPU can be relaxed/eliminated.

Can you share more details of your dynamic batch size scenario. For example, is it similar to curriculum learning, which scales batch size and sequence length dynamically and works with DeepSpeed.

Can you share a repro for the error you are seeing? In particular, it would be good to see your ds_config and deepspeed.intialize call. There might be a simple workaround similar to the HF integration.

Xiang-cd added the enhancement New feature or request label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] dynamic batch size with gradient accumulate #6533

[REQUEST] dynamic batch size with gradient accumulate #6533

Xiang-cd commented Sep 13, 2024

Xiang-cd commented Sep 18, 2024

tjruwase commented Sep 19, 2024

[REQUEST] dynamic batch size with gradient accumulate #6533

[REQUEST] dynamic batch size with gradient accumulate #6533

Comments

Xiang-cd commented Sep 13, 2024

Xiang-cd commented Sep 18, 2024

tjruwase commented Sep 19, 2024