Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
v0.13.3 Patch release
What's Changed
- Update version.txt after 0.13.2 release by @mrwyattii in #5119
- Stop tracking backward chain of broadcast (ZeRO3) by @tohtana in #5113
- [NPU]ZeRO-Infinity feature compatibility by @misstek in #5077
- BF16 optimizer: Improve device utilization by immediate grad update by @deepcharm in #4975
- removed if condition in
if collate_fn is None
by @bm-synth in #5107 - disable compile tests for torch<2.1 by @mrwyattii in #5121
- Update inference test model names by @mrwyattii in #5127
- Fix issue with zero-sized file after merging file on curriculum
map_reduce
by @bm-synth in #5106 - Update return codes in PyTest to properly error out if tests fail by @loadams in #5122
- add missing methods to MPS_Accelerator by @mrwyattii in #5134
- Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. by @bm-synth in #5108
- Fix broadcast deadlock for incomplete batches in data sample for data analysis by @bm-synth in #5117
- Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning by @bm-synth in #5118
- remove mandatory
index
key from output ofmetric_function
inDataAnalysis
map operation by @bm-synth in #5112 - tensorboard logging: avoid item() outside gas to improve performance by @nelyahu in #5135
- Check overflow on device without host synchronization for each tensor by @BacharL in #5115
- Update nv-inference torch version by @loadams in #5128
- Method
run_map_reduce
to fix errors when runningrun_map
followed byrun_reduce
by @bm-synth in #5131 - Added missing
isinstance
check in PR 5112 by @bm-synth in #5142 - Fix UserWarning: The torch.cuda.*DtypeTensor constructors are no long… by @ShukantPal in #5018
- TestEmptyParameterGroup: replace fusedAdam with torch.optim.AdamW by @nelyahu in #5139
- Update deprecated HuggingFace function by @mrwyattii in #5144
- Pin to PyTest 8.0.0 by @loadams in #5163
- get_grad_norm_direct: fix a case of empty norm group by @nelyahu in #5148
- Distributed in-memory map-reduce for data analyzer by @bm-synth in #5129
- DeepSpeedZeroOptimizer_Stage3: remove cuda specific optimizer by @nelyahu in #5138
- MOE: Fix save checkpoint when TP > 1 by @mosheisland in #5157
- Fix gradient clipping by @tohtana in #5150
- Use ninja to speed up build by @jinzhen-lin in #5088
- Update flops profiler to handle attn and matmul by @KimmiShi in #4724
- Fix allreduce for BF16 and ZeRO0 by @tohtana in #5170
- Write multiple items to output file at once, in distributed data analyzer. by @bm-synth in #5169
- Fix typos in blogs/ by @jinyouzhi in #5172
- Inference V2 Human Eval by @lekurile in #4804
- Reduce ds_id name length by @jomayeri in #5176
- Switch cpu-inference workflow from --extra-index-url to --index-url by @loadams in #5182
New Contributors
Full Changelog: v0.13.2...v0.13.3
v0.13.2 Patch release
What's Changed
- Update version.txt after 0.13.1 release by @mrwyattii in #5002
- Support
exclude_frozen_parameters
forsave_16bit_model
by @LZHgrla in #4999 - Allow nightly tests dispatch by @mrwyattii in #5014
- Enable hpz based on secondary tensor presence by @HeyangQin in #4906
- Enable workflow dispatch on all workflows by @loadams in #5016
- [minor] improve code quality and readablilty by @ByronHsu in #5011
- Update falcon fused type order by @Yejing-Lai in #5007
- Fix error report of DSElasticAgent._set_master_addr_port() by @RobinDong in #4985
- DS #4993 #662 : autotune single node hostfile bugfix by @oushu1zhangxiangxuan1 in #4996
- [minor] Improve logging for multiprocesses by @ByronHsu in #5004
- deepspeed/launcher: add launcher_helper as each rank's start portal by @YizhouZ in #4699
- Graph capture support on HPU accelerators by @deepcharm in #5013
- launcher/launcher_helper.py: fix PMI name and add EnvironmentError by @YizhouZ in #5025
- Remove MI100 badge from landing page by @mrwyattii in #5036
- Remove coverage reports from workflows and fix for inference CI by @loadams in #5028
- Remove Megatron-DeepSpeed CI workflow by @mrwyattii in #5038
- Fix P40 CI failures by @mrwyattii in #5037
- Fix for nightly torch CI by @mrwyattii in #5039
- Fix nv-accelerate and nv-torch-latest-v100. by @loadams in #5035
- update inference pages to point to FastGen by @mrwyattii in #5029
- launcher_helper: enable fds passing by @YizhouZ in #5042
- Fix nv-torch-latest-cpu CI by @mrwyattii in #5045
- [NPU] Add NPU to support hybrid engine by @CurryRice233 in #4831
- MoE type hints by @ringohoffman in #5043
- [doc] update inference related docs from
mp_size
totensor_parallel
for TP by @yundai424 in #5048 - Fix broken model names in inference CI by @mrwyattii in #5053
- [NPU] Change log level to debug by @CurryRice233 in #5051
- Delay reduce-scatter for ZeRO3 leaf modules by @tohtana in #5008
- Optimize grad_norm calculations by reducing device/host dependency by @nelyahu in #4974
- load linear layer weight with given dtype by @polisettyvarma in #4044
- Update import for changes to latest diffusers by @mrwyattii in #5065
- adding hccl to init_distributed function description by @nelyahu in #5034
- [Zero++ qgZ] Fall back to reduce_scatter if
tensor.numel() % (2 * global_world_size) != 0
by @ByronHsu in #5056 - Make batch size documentation clearer by @segyges in #5072
- [doc/1-line change] default stage3_param_persistence_threshold is wrong in the doc by @ByronHsu in #5073
- Further refactor deepspeed.moe.utils + deepspeed.moe.layer type hints by @ringohoffman in #5060
- Fix verification for ZeRO3 leaf module by @tohtana in #5074
- Stop tracking backward chain of broadcast in initialization by @tohtana in #5075
- Update torch version for nv-torch-latest-cpu by @loadams in #5086
- Add backwards compatibility w/ older versions of diffusers (<0.25.0) by @lekurile in #5083
- Enable torch.compile with ZeRO (Experimental) by @tohtana in #4878
- Update nv-accelerate to latest torch by @loadams in #5040
- HPU Accelerator: fix supported_dtypes API by @nelyahu in #5094
- [NPU] replace 'cuda' with get_accelerator().device_name() by @minchao-sun in #5095
- optimize clip_grad_norm_ function by @mmhab in #4915
- [xs] fix ZEROPP convergence test by @yundai424 in #5061
- Switch hasattr check from compile to compiler by @loadams in #5096
- Split is_synchronized_device api to multiple apis by @BacharL in #5026
- 47% FastGen speedup for low workload - refactor allocator by @HeyangQin in #5090
- Support
exclude_frozen_parameters
forzero_to_fp32.py
script by @andstor in #4979 - Fix alignment of optimizer states when loading by @tohtana in #5105
- Skip Triton import for AMD by @lekurile in #5110
- Add HIP conversion file outputs to .gitignore by @lekurile in #5111
- Remove optimizer step on initialization by @tohtana in #5104
New Contributors
- @ByronHsu made their first contribution in #5011
- @RobinDong made their first contribution in #4985
- @oushu1zhangxiangxuan1 made their first contribution in #4996
- @yundai424 made their first contribution in #5048
- @segyges made their first contribution in #5072
- @andstor made their first contribution in #4979
Full Changelog: v0.13.1...v0.13.2
v0.13.1 Patch release
What's Changed
- Update version.txt after 0.13.0 release by @mrwyattii in #4982
- Update FastGen blog title by @arashb in #4983
- Fix the MoE-params gradient-scaling by @RezaYazdaniAminabadi in #4957
- fix some typo under blogs/ by @digger-yu in #4988
- Fix placeholder value in FastGen Blog by @mrwyattii in #5000
- fix for DS_ENV issue by @jeffra in #4992
- Delete unused --deepspeed_mpi command line argument by @ShukantPal in #4981
- Make installable without torch by @mrwyattii in #5001
- Implement some APIs of HPU accelerator by @mmhab in #4935
- Refactor the Qwen positional emebdding config code by @ZonePG in #4955
Full Changelog: v0.13.0...v0.13.1
DeepSpeed v0.13.0
New Features
What's Changed
- Update version.txt after 0.12.6 release by @mrwyattii in #4850
- doc corrections by @goodship1 in #4861
- Fix exception handling in get_all_ranks_from_group() function by @HeyangQin in #4862
- deepspeed engine: fp16 support validation on init by @nelyahu in #4843
- Remove hooks on gradient accumulation on engine/optimizer destroy by @chiragjn in #4858
- optimize grad_norm calculation in stage3.py by @mmhab in #4436
- Fix f-string messages by @li-plus in #4865
- [NPU] Fix npu offload bug by @CurryRice233 in #4883
- Partition parameters: Minor refactoring of use_secondary_tensor condition by @deepcharm in #4868
- Pipeline: Add support to eval micro bs configuration by @nelyahu in #4859
- zero_to_fp32.py: Handle a case where shape doesn't have numel attr by @nelyahu in #4842
- Add support of Microsoft Phi-2 model to DeepSpeed-FastGen by @arashb in #4812
- Support cpu tensors without direct device invocation by @abhilash1910 in #3842
- add sharded loading for safetensors in AutoTP by @sywangyi in #4854
- [XPU] XPU accelerator support for Intel GPU device by @delock in #4547
- enable starcode((kv_head=1)) autotp by @Yejing-Lai in #4896
- Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 by @li-plus in #4887
- [NPU]Add ZeRO-Infinity feature for NPU by @misstek in #4809
- fix num_kv_heads sharding in uneven autoTP for Falcon-40b by @Yejing-Lai in #4712
- Nvme offload checkpoint by @eisene in #4707
- Add WarmupCosineLR to Read the Docs by @dwyatte in #4916
- Add Habana Labs HPU accelerator support by @deepcharm in #4912
- Unit tests for MiCS by @zarzen in #4792
- Fix SD workflow to work with latest diffusers version by @lekurile in #4918
- [Fix] Fix cpu inference UT failure by @delock in #4430
- Add paths to run SD tests by @loadams in #4919
- Change PR/schedule triggers for CPU-inference by @loadams in #4924
- fix falcon-40b accuracy issue by @Yejing-Lai in #4895
- Refactor the positional emebdding config code by @arashb in #4920
- Pin to triton 2.1.0 to fix issues with nv-inference by @loadams in #4929
- Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen by @ZonePG in #4913
- DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators by @nelyahu in #4833
- Fix confusing width in simd_load by @yzhblind in #4714
- Specify permissions for secrets.GITHUB_TOKEN by @mrwyattii in #4927
- Enable quantizer op on ROCm by @rraminen in #4114
- autoTP for Qwen by @inkcherry in #4902
- Allow specifying mii branch for nv-a6000 workflow by @mrwyattii in #4936
- Only run MII CI for inference changes by @mrwyattii in #4939
- InfV2 - remove generation config requirement by @mrwyattii in #4938
- Cache HF model list for inference tests by @mrwyattii in #4940
- Fix docs inconsistency on default value for
ignore_unused_parameters
by @loadams in #4949 - Fix bug in CI model caching by @mrwyattii in #4951
- fix uneven issue & add balance autotp by @Yejing-Lai in #4697
- Optimize preprocess for ragged batching by @tohtana in #4942
- Fix bug where ZeRO2 never uses the reduce method. by @CurryRice233 in #4946
- [docs] Add new autotp supported model in tutorial by @delock in #4960
- Add missing op_builder.hpu component for HPU accelerator by @nelyahu in #4963
- Stage_1_and_2.py: fix assert for reduce_scatter configurations combinations by @nelyahu in #4964
- [MiCS]Add the path to support sequence_data_parallel on MiCS by @ys950902 in #4926
- Update the DeepSpeed Phi-2 impl. to work with the HF latest changes by @arashb in #4950
- Prevent infinite recursion when DS_ACCELERATOR is set to cuda by @ShukantPal in #4962
- Fixes for training models with bf16 + freshly initialized optimizer via
load_module_only
by @haileyschoelkopf in #4141 - params partition for skip_init by @inkcherry in #4722
- Enhance query APIs for text generation by @tohtana in #4965
- Add API to set a module as a leaf node when recursively setting Z3 hooks by @tohtana in #4966
- Fix T5 and mistral model meta data error by @Yejing-Lai in #4958
- FastGen Jan 2024 blog by @mrwyattii in #4980
New Contributors
- @chiragjn made their first contribution in #4858
- @li-plus made their first contribution in #4865
- @misstek made their first contribution in #4809
- @dwyatte made their first contribution in #4916
- @ZonePG made their first contribution in #4913
- @yzhblind made their first contribution in #4714
- @ShukantPal made their first contribution in #4962
- @haileyschoelkopf made their first contribution in #4141
Full Changelog: v0.12.6...v0.13.0
v0.12.6: Patch release
What's Changed
- Update version.txt after 0.12.5 release by @mrwyattii in #4826
- Cache metadata for TP activations and grads by @BacharL in #4360
- Inference changes for incorporating meta loading checkpoint by @oelayan7 in #4692
- Update CODEOWNERS by @mrwyattii in #4838
- support baichuan model: by @baodii in #4721
- inference engine: check if accelerator supports FP16 by @nelyahu in #4832
- Update zeropp.md by @goodship1 in #4835
- [NPU] load EXPORT_ENV based on different accelerators to support multi-node training on other devices by @minchao-sun in #4830
- Add cuda_accelerator.py to triggers for A6000 test by @mrwyattii in #4848
- Capture short kernel sequences to graph by @inkcherry in #4318
- Checkpointing: Avoid assigning tensor storage with different device by @deepcharm in #4836
- engine.py: remove unused _curr_save_path by @nelyahu in #4844
- Mixtral FastGen Support by @cmikeh2 in #4828
New Contributors
- @minchao-sun made their first contribution in #4830
Full Changelog: v0.12.5...v0.12.6
v0.12.5: Patch release
What's Changed
- Fix DS Stable Diffusion for latest diffusers version by @lekurile in #4770
- Resolve any '..' in the file paths using os.path.abspath() by @rraminen in #4709
- Update dockerfile with updated versions by @loadams in #4780
- Run workflows when they are edited by @loadams in #4779
- BF16_Optimizer: add support for bf16 grad acc by @nelyahu in #4713
- fix autoTP issue for mpt (trust_remote_code=True) by @sywangyi in #4787
- Fix Hybrid Engine metrics printing by @lekurile in #4789
- [BUG] partition_balanced return wrong result. by @zjjMaiMai in #4312
- improve the way to determine whether a variable is None by @RUAN-ZX in #4782
- [NPU] Add HcclBackend for 1-bit adam, 1-bit lamb, 0/1 adam by @RUAN-ZX in #4733
- Fix for stage3 when setting different communication data type by @BacharL in #4540
- Add support of Falcon models (7b, 40b, 180b) to DeepSpeed-FastGen by @arashb in #4790
- Switch paths-ignore to single quotes, update paths-ignore on nv-pre-compile-ops by @loadams in #4805
- fix for tests using torch<2.1 by @mrwyattii in #4818
- Universal Checkpoint for Sequence Parallelism by @samadejacobs in #4752
- Accelerate CI fix by @mrwyattii in #4819
- fix [BUG] 'DeepSpeedGPTInference' object has no attribute 'dtype' for… by @jxysoft in #4814
- Update broken link in docs by @mrwyattii in #4822
- Update imports from Transformers by @loadams in #4817
- Minor updates to CI workflows by @mrwyattii in #4823
- fix falcon model load from_config meta_data error by @baodii in #4783
- mv DeepSpeedEngine param_names dict init post _configure_distributed_model by @nelyahu in #4803
- Refactor launcher user arg parsing by @mrwyattii in #4824
- Fix 4649 by @Alienfeel in #4650
New Contributors
- @zjjMaiMai made their first contribution in #4312
- @jxysoft made their first contribution in #4814
- @baodii made their first contribution in #4783
- @Alienfeel made their first contribution in #4650
Full Changelog: v0.12.4...v0.12.5
v0.12.4: Patch release
What's Changed
- Update version.txt after 0.12.3 release by @mrwyattii in #4673
- [MII] catch error wrt HF version and Mistral by @jeffra in #4634
- [NPU] Add NPU support for unit test by @RUAN-ZX in #4569
- [op-builder] use unique exceptions for cuda issues by @jeffra in #4653
- Add stable diffusion unit test by @mrwyattii in #2496
- [CANN] Support cpu offload optimizer for Ascend NPU by @hipudding in #4568
- Inference Checkpoints in V2 by @cmikeh2 in #4664
- KV Cache Improved Flexibility by @cmikeh2 in #4668
- Fix for when prompt contains an odd num of apostrophes by @oelayan7 in #4660
- universal-ckp: support megatron-deepspeed llama model by @mosheisland in #4666
- Add new MII unit tests by @mrwyattii in #4693
- [Bug fix] WarmupCosineLR issues by @sbwww in #4688
- infV2 fix for OPT size variants by @mrwyattii in #4694
- Add get and set APIs for the ZeRO-3 partitioned parameters by @yiliu30 in #4681
- Remove unneeded dict reinit (fix for #4565) by @eisene in #4702
- Update flops profiler to recurse by @loadams in #4374
- Communication Optimization for Large-Scale Training by @RezaYazdaniAminabadi in #4695
- [docs] Intel inference blog by @jeffra in #4734
- use all_gather_into_tensor instead of all_gather by @taozhiwei in #4705
- Install
deepspeed-kernels
only on Linux by @aphedges in #4739 - Add nv-sd badge to README by @loadams in #4747
- Re-organize
.gitignore
file to be parsed properly by @aphedges in #4740 - fix mics run with offload++ by @GuanhuaWang in #4749
- Fix logger formatting for partitioning flags by @OAfzal in #4728
- fix: to solve #4726 by @RUAN-ZX in #4727
- Add safetensors support by @jihnenglin in #4659
New Contributors
- @RUAN-ZX made their first contribution in #4569
- @oelayan7 made their first contribution in #4660
- @sbwww made their first contribution in #4688
- @yiliu30 made their first contribution in #4681
- @eisene made their first contribution in #4702
- @taozhiwei made their first contribution in #4705
- @OAfzal made their first contribution in #4728
- @jihnenglin made their first contribution in #4659
Full Changelog: v0.12.3...v0.12.4
v0.12.3: Patch release
New Bug Fixes
- Stable Diffusion now supported with latest Torch, diffusers, and Triton versions.
What's Changed
- Update version.txt after 0.12.2 release by @mrwyattii in #4617
- Fix figure in FlexGen blog by @tohtana in #4624
- Fix figure of llama2 13B in DS-FlexGen blog by @tohtana in #4625
- Fix config format by @xu-song in #4594
- Guanhua/partial offload rebase v2 (#590) by @GuanhuaWang in #4636
- offload++ blog (#623) by @GuanhuaWang in #4637
- Update README in offloadpp blog by @GuanhuaWang in #4641
- [docs] update news items by @jeffra in #4640
- DeepSpeed-FastGen Chinese Blog by @HeyangQin in #4642
- Fix issues with torch cpu builds by @loadams in #4639
- Isolate src code and testing for DeepSpeed-FastGen by @cmikeh2 in #4610
- Add Japanese blog for DeepSpeed-FastGen by @tohtana in #4651
- Fix for MII unit tests by @mrwyattii in #4652
- Enhance the robustness of
module_state_dict
by @LZHgrla in #4587 - Enable ZeRO3 allgather for multiple dtypes by @tohtana in #4647
- add option to disable pipeline partitioning by @nelyahu in #4322
- Added HIP_PLATFORM_AMD=1 for non JIT build by @rraminen in #4585
- Fix rope_theta arg for diffusers_attention by @lekurile in #4656
- tl.dot(a,b, trans_b=True) is not supported by triton2.0+ , updating this api by @bmedishe in #4541
- Update ds-chat workflow to work w/ deepspeed-chat install by @lekurile in #4598
- Diffusers attention script update triton2.1 by @bmedishe in #4573
- Fix the openfold training. by @cctry in #4657
- Universal ckp fixes by @mosheisland in #4588
- Update .gitignore [Adding comments , Improved documentation] by @Nadav23AnT in #4631
- Update lr_schedules.py by @CoinCheung in #4563
- Fix UNET and VAE implementations for new diffusers version by @lekurile in #4663
- fix num_kv_heads sharding in autoTP for the new in-repo Falcon-40B by @dc3671 in #4654
New Contributors
- @xu-song made their first contribution in #4594
- @LZHgrla made their first contribution in #4587
- @mosheisland made their first contribution in #4588
- @Nadav23AnT made their first contribution in #4631
- @CoinCheung made their first contribution in #4563
Full Changelog: v0.12.2...v0.12.3
v0.12.2
What's Changed
- Quick bug fix direct to
master
to ensure mismatched cuda environments are shown to the user 4f7dd72 - Update version.txt after 0.12.1 release by @mrwyattii in #4615
Full Changelog: v0.12.1...v0.12.2
v0.12.1: Patch release
What's Changed
- Update version.txt after 0.12.0 release by @mrwyattii in #4611
- Add number for latency comparison by @tohtana in #4612
- Update minor CUDA version compatibility. by @cmikeh2 in #4613
Full Changelog: v0.12.0...v0.12.1