Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads #6526

Open
stas00 opened this issue Sep 11, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@stas00
Copy link
Collaborator

stas00 commented Sep 11, 2024

When https://github.com/microsoft/DeepSpeed/blob/c27483933d50a693fef9c48418d2664cf6a6a6f8/deepspeed/utils/zero_to_fp32.py was written 3 years ago models were small and converted fast. Now with 70B+ models the conversion can take hours.

The original script uses a single cpu core.

Here is a possible implementation algorithm:

The way I was thinking multiple cores could be utilized by loading all shards into the cpu memory and then firing off multiple threads, each re-composing a single layer - the user could specify how many cores to use or by default all cores will be used - so that n_threads == cores. I think the total memory usage here will still be 2x model size * dtype just like in the original script.

Possible additional changes:

  • Using safetensors would be a bonus because then each tensor could be written separately and there is no need to wait for the whole model to be unsharded to write a single torch tensor. This could also become an option for low RAM nodes, where each layer is unsharded sequentially and total memory usage will be 1x model size * dtype + max layer size * dtype, which for a large model be a huge memory saving, at the cost of not parallelizing - or perhaps using just 1-2 threads, which would already speed things up.
  • Switching to universal checkpoint API would be another bonus because the original is very clunky and very difficult to understand/maintain.

cc: @tjruwase

@stas00 stas00 added the enhancement New feature or request label Sep 11, 2024
@stas00 stas00 changed the title [REQUEST] parallelize zero_to_fp32.py to use multiple cores [REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant