Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Num. instances discarded samples: 252. VITS finetuning #3347

Closed
nicholasguimaraes opened this issue Dec 1, 2023 · 1 comment
Closed
Assignees
Labels
bug Something isn't working

Comments

@nicholasguimaraes
Copy link

Describe the bug

Hello, I'm finetuning the VITS model on my 979 audio dataset.

I have followed this tutorial (https://tts.readthedocs.io/en/latest/formatting_your_dataset.html) on how to format my dataset and it looks just fine.

I am able to start training but right before the first epoch I see the log:

DataLoader initialization
| > Tokenizer:
| > add_blank: True
| > use_eos_bos: False
| > use_phonemes: False
| > Number of instances : 979
| > Preprocessing samples
| > Max text length: 94
| > Min text length: 15
| > Avg text length: 50.786795048143055
|
| > Max audio length: 223959.0
| > Min audio length: 46143.0
| > Avg audio length: 144398.36176066025
| > Num. instances discarded samples: 252
| > Batch group size: 0.

My first thought was to check the min/mix permited length of audio samples on my config.json file.

My minimal desired length is of 16.000 (about 1 second) but after editing it in the config.json file it does not seem to have any effect in the code since the log still prints Min audio length: 46143.0.

What is even funnier is that 46143.0 is about 2 seconds and I only have 77 audio samples smaller than 2 seconds so I'm not really sure what happens for the network to discart 252 samples.

Any insights are very welcomed!

To Reproduce

To reproduce the error you need to install coqui-ai TTS and set up a custom dataset for finetuning (ljspeech format).

After that run: python train_tts.py --config_path tts_models--pt--cv--vits/config.json --restore_path tts_models--pt--cv--vits/model_file.pth.tar

Expected behavior

No response

Logs

No response

Environment

TTS 0.20.6
torch gpu 2.1.0
cuda 12.0
nvidia driver 527.41

Additional context

No response

@nicholasguimaraes nicholasguimaraes added the bug Something isn't working label Dec 1, 2023
@Edresson Edresson self-assigned this Dec 6, 2023
@Edresson
Copy link
Contributor

Edresson commented Dec 6, 2023

Hi @nicholasguimaraes,

What is the audio format that you are using for training?

I think the issues is that the current audio len computation only supports 16 bits wave files. If it is a different audio format it computes the length wrongly. I fixed this issue on the PR #3092

@erogol I think we should merge #3092 as soon as possible to avoid issues like this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants