-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[TRACKER] Customer support related PR tracker for Intel devices
enhancement
New feature or request
#6556
opened Sep 20, 2024 by
delock
20 of 23 tasks
[BUG] ACL stream synchronize failed, error code 507015
bug
Something isn't working
compression
#6555
opened Sep 19, 2024 by
janelu9
[BUG] CUDA error: no kernel image is available for execution on the device
bug
Something isn't working
training
#6549
opened Sep 17, 2024 by
getao
[REQUEST] A minimal example to load universal checkpoint
enhancement
New feature or request
#6548
opened Sep 17, 2024 by
hsl89
[BUG] Expert gradient scaling problem with ZeRO optimizer
bug
Something isn't working
training
#6545
opened Sep 17, 2024 by
wyooyw
RecursionError: maximum recursion depth exceeded while calling a Python object
bug
Something isn't working
training
#6534
opened Sep 13, 2024 by
Swordfish1990
[REQUEST] dynamic batch size with gradient accumulate
enhancement
New feature or request
#6533
opened Sep 13, 2024 by
Xiang-cd
[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads
enhancement
New feature or request
#6526
opened Sep 11, 2024 by
stas00
[BUG] Distributed Training randomly stuck in trainings loop
bug
Something isn't working
training
#6524
opened Sep 11, 2024 by
raeudigerRaeffi
[BUG] error :past_key, past_value = layer_past,how to solve this ?
bug
Something isn't working
deepspeed-chat
Related to DeepSpeed-Chat
#6522
opened Sep 11, 2024 by
lovychen
[BUG] RuntimeError: Error building extension 'inference_core_ops'
bug
Something isn't working
inference
#6519
opened Sep 10, 2024 by
Chetan3200
[REQUEST] ZeRO3 doc - support for wrapping model sub-components seperately for training
enhancement
New feature or request
#6505
opened Sep 8, 2024 by
orrzohar
[BUG] Universal checkpointing doesn't work when changing model parallel size (pp and dp change are ok)
bug
Something isn't working
compression
#6503
opened Sep 8, 2024 by
exnx
[BUG] Config mesh_device None
bug
Something isn't working
training
#6501
opened Sep 6, 2024 by
GuanhuaWang
[BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"
bug
Something isn't working
training
#6497
opened Sep 5, 2024 by
sirus20x6
[BUG] Something isn't working
training
GatheredParameters
resulting in NCCL mismatched collectives error
bug
#6492
opened Sep 5, 2024 by
jeromeku
Deep Speed Optimizer index out of range during Training
bug
Something isn't working
training
#6482
opened Sep 4, 2024 by
manitadayon
[REQUEST] How to use New feature or request
Flops Profiler
to test model.generate(), with customized forward context manager
enhancement
#6475
opened Sep 2, 2024 by
lemyx
[REQUEST] MiCS vs Zero++ hpZ for Hybrid FSDP
enhancement
New feature or request
#6467
opened Aug 31, 2024 by
jeromeku
[REQUEST] Replace reduce in ZERO 1/2/3 with reduce_scatter
enhancement
New feature or request
#6463
opened Aug 30, 2024 by
efsotr
[BUG] setting PYTORCH_CUDA_ALLOC_CONF in .deepspeed_env raise return code = -11
bug
Something isn't working
training
#6454
opened Aug 28, 2024 by
nomadlx
Previous Next
ProTip!
Updated in the last three days: updated:>2024-09-17.