Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Training XTTSv2 from Coqui TTS leads to weird training lags with DDP #145

Open
NikitaKononov opened this issue Jul 1, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@NikitaKononov
Copy link

Describe the bug

Hello, training XTTSv2 from Coqui TTS leads to weird training lags with using DDP
x6 RTX a6000 and 512GB RAM

Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)

image

With 4 GPU situation remains the same

I think there's some kind of error in Trainer.

To Reproduce

python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5

Expected behavior

No response

Logs

No response

Environment

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:01:00.0 Off |                  Off |
| 46%   70C    P2             229W / 300W |  32382MiB / 49140MiB |     91%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               On  | 00000000:25:00.0 Off |                  Off |
| 42%   68C    P2             246W / 300W |  27696MiB / 49140MiB |     77%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               On  | 00000000:41:00.0 Off |                  Off |
| 38%   67C    P2             256W / 300W |  27640MiB / 49140MiB |     63%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               On  | 00000000:81:00.0 Off |                  Off |
| 39%   67C    P2             245W / 300W |  27640MiB / 49140MiB |     67%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000               On  | 00000000:A1:00.0 Off |                  Off |
| 46%   70C    P2             239W / 300W |  27620MiB / 49140MiB |     66%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000               On  | 00000000:C2:00.0 Off |                  Off |
| 30%   31C    P8              17W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2517964      C   ...onov/anaconda3/envs/xtts/bin/python    32374MiB |
|    1   N/A  N/A   2516039      C   python3                                   27688MiB |
|    2   N/A  N/A   2516040      C   python3                                   27632MiB |
|    3   N/A  N/A   2516041      C   python3                                   27632MiB |
|    4   N/A  N/A   2516042      C   python3                                   27612MiB |
+---------------------------------------------------------------------------------------+

Additional context

No response

@NikitaKononov NikitaKononov added the bug Something isn't working label Jul 1, 2024
@NikitaKononov
Copy link
Author

tried num_workers=0, >0, MP_THREADS_NUM and so on, nothing helps
lots of ram and shared memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant