[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads #6526

stas00 · 2024-09-11T22:52:42Z

When https://github.com/microsoft/DeepSpeed/blob/c27483933d50a693fef9c48418d2664cf6a6a6f8/deepspeed/utils/zero_to_fp32.py was written 3 years ago models were small and converted fast. Now with 70B+ models the conversion can take hours.

The original script uses a single cpu core.

Here is a possible implementation algorithm:

The way I was thinking multiple cores could be utilized by loading all shards into the cpu memory and then firing off multiple threads, each re-composing a single layer - the user could specify how many cores to use or by default all cores will be used - so that n_threads == cores. I think the total memory usage here will still be 2x model size * dtype just like in the original script.

Possible additional changes:

Using safetensors would be a bonus because then each tensor could be written separately and there is no need to wait for the whole model to be unsharded to write a single torch tensor. This could also become an option for low RAM nodes, where each layer is unsharded sequentially and total memory usage will be 1x model size * dtype + max layer size * dtype, which for a large model be a huge memory saving, at the cost of not parallelizing - or perhaps using just 1-2 threads, which would already speed things up.
Switching to universal checkpoint API would be another bonus because the original is very clunky and very difficult to understand/maintain.

cc: @tjruwase

The text was updated successfully, but these errors were encountered:

stas00 added the enhancement New feature or request label Sep 11, 2024

stas00 changed the title ~~[REQUEST] parallelize zero_to_fp32.py to use multiple cores~~ [REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads #6526

[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads #6526

stas00 commented Sep 11, 2024 •

edited

Loading

[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads #6526

[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads #6526

Comments

stas00 commented Sep 11, 2024 • edited Loading

stas00 commented Sep 11, 2024 •

edited

Loading