[Bug] Unable to run distributed training using TTS recipe for yourtts #113

Draegon366 · 2023-06-24T13:36:46Z

Describe the bug

I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute.
Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing.
I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset.
It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.

Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.

To Reproduce

Run CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py on google compute instance
Wait several seconds
Error.

Expected behavior

Runs the training script with processing split between the GPUs.

Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit
    self._fit()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__
    item = self.samples[idx]
TypeError: list indices must be integers or slices, not list

Environment

{       
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.27",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "",
        "python": "3.10.10",
        "version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
    }
}

Additional context

No response

The text was updated successfully, but these errors were encountered:

NikitaKononov · 2024-07-03T11:19:40Z

Describe the bug

I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute. Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing. I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset. It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.

Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.

To Reproduce

Run CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py on google compute instance
Wait several seconds
Error.

Expected behavior

Runs the training script with processing split between the GPUs.

Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit
    self._fit()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__
    item = self.samples[idx]
TypeError: list indices must be integers or slices, not list

Environment

{       
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.27",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "",
        "python": "3.10.10",
        "version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
    }
}

Additional context

No response

Hello, did you find a way to deal with it?
Faced same troubles)
They thought, for a reason, not to use spawn in torch DDP, maybe that is a problem?

Draegon366 added the bug Something isn't working label Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Unable to run distributed training using TTS recipe for yourtts #113

[Bug] Unable to run distributed training using TTS recipe for yourtts #113

Draegon366 commented Jun 24, 2023

NikitaKononov commented Jul 3, 2024

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

[Bug] Unable to run distributed training using TTS recipe for yourtts #113

[Bug] Unable to run distributed training using TTS recipe for yourtts #113

Comments

Draegon366 commented Jun 24, 2023

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

NikitaKononov commented Jul 3, 2024

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context