You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way I was thinking multiple cores could be utilized by loading all shards into the cpu memory and then firing off multiple threads, each re-composing a single layer - the user could specify how many cores to use or by default all cores will be used - so that n_threads == cores. I think the total memory usage here will still be 2x model size * dtype just like in the original script.
Possible additional changes:
Using safetensors would be a bonus because then each tensor could be written separately and there is no need to wait for the whole model to be unsharded to write a single torch tensor. This could also become an option for low RAM nodes, where each layer is unsharded sequentially and total memory usage will be 1x model size * dtype + max layer size * dtype, which for a large model be a huge memory saving, at the cost of not parallelizing - or perhaps using just 1-2 threads, which would already speed things up.
Switching to universal checkpoint API would be another bonus because the original is very clunky and very difficult to understand/maintain.
stas00
changed the title
[REQUEST] parallelize zero_to_fp32.py to use multiple cores
[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads
Sep 11, 2024
When https://github.com/microsoft/DeepSpeed/blob/c27483933d50a693fef9c48418d2664cf6a6a6f8/deepspeed/utils/zero_to_fp32.py was written 3 years ago models were small and converted fast. Now with 70B+ models the conversion can take hours.
The original script uses a single cpu core.
Here is a possible implementation algorithm:
The way I was thinking multiple cores could be utilized by loading all shards into the cpu memory and then firing off multiple threads, each re-composing a single layer - the user could specify how many cores to use or by default all cores will be used - so that
n_threads == cores
. I think the total memory usage here will still be2x model size * dtype
just like in the original script.Possible additional changes:
safetensors
would be a bonus because then each tensor could be written separately and there is no need to wait for the whole model to be unsharded to write a single torch tensor. This could also become an option for low RAM nodes, where each layer is unsharded sequentially and total memory usage will be1x model size * dtype
+max layer size * dtype
, which for a large model be a huge memory saving, at the cost of not parallelizing - or perhaps using just 1-2 threads, which would already speed things up.cc: @tjruwase
The text was updated successfully, but these errors were encountered: