Adds multispeaker support to inference_onnx with VITS, makes onnx inference faster #2725

Iamgoofball · 2023-06-30T05:20:08Z

Adjusts the provider choice for CUDAExecutionProvider based on microsoft/onnxruntime#12880 (comment) and #2563 (comment)

Fixed onnx inference to support multispeaker.

Benched it against a personal multispeaker model with the following setup:

time_list = []
text_inputs_list = []
num_speakers = range(265)
print("Beginning 500 32 buffalo random-speaker inference test with text inputs calc'd each time")
for i in range(500):
    starttime = time.time()
    text = "Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo."
    text_inputs = np.asarray(
        vits.tokenizer.text_to_ids(text, language="en"),
        dtype=np.int64,
    )[None, :]
    text_inputs_list.append(time.time()-starttime)
    starttime = time.time()
    audio = vits.inference_onnx(text_inputs, speaker_id=random.choice(num_speakers))
    time_list.append(time.time()-starttime)
    save_wav(wav=audio[0], path="./buffalo_samples/coqui_vits _" + str(i) + ".wav", sample_rate=config.audio.sample_rate)
print("Mean text_inputs time: " + str(statistics.mean(text_inputs_list)))
print("Mean inference time: " + str(statistics.mean(time_list)))
print("Total inference time: " + str(sum(time_list))))

We were able to generate 2 hours and 15 minutes of roughly 15-20 second clips within 32 seconds on an RTX 4080 using this setup.

Beginning 500 32 buffalo random-speaker inference test with text inputs calc'd each time
Mean text_inputs time: 0.03477813768386841
Mean inference time: 0.06550791311264038
Total inference time: 32.75395655632019

…erence faster

CLAassistant · 2023-06-30T05:20:13Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

erogol · 2023-07-04T09:31:12Z

thanks for the PR @Iamgoofball

Do you plan to sign the CLA?

Iamgoofball · 2023-07-04T10:51:35Z

I don't like putting my real name on things on this account, sorry. If you wanna mirror the PR though go for it, it's like 2 lines anyways and one of them's just borrowed from the comment linked in the OP about onnx params, I'll close this one when you do.

jdlproenca · 2023-07-05T12:49:01Z

Thank you @Iamgoofball !
Although it might be a rare use-case, I have a multi-speaker model with only 1 speaker, and I had to make one extra change in export_onnx from if self.num_speakers > 1: to if self.num_speakers > 0: here:

TTS/TTS/tts/models/vits.py

Line 1848 in 4bebd23

if self.num_speakers > 1:

Does it make sense in general? I see more if self.num_speakers > 0: in vits.py

erogol · 2023-07-06T08:00:31Z

Closing this for. #2743

Nanayeb34 · 2023-09-14T21:56:59Z

what version of TTS are you using? @Iamgoofball

Adds multispeaker support to inference_onnx with VITS, makes onnx inf…

4bebd23

…erence faster

Edresson mentioned this pull request Jun 30, 2023

[Feature request] [Bug] ONNX inference not working for multi-speaker VITS model #2713

Closed

Edresson requested a review from erogol July 3, 2023 13:51

erogol closed this Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds multispeaker support to inference_onnx with VITS, makes onnx inference faster #2725

Adds multispeaker support to inference_onnx with VITS, makes onnx inference faster #2725

Iamgoofball commented Jun 30, 2023 •

edited

Loading

CLAassistant commented Jun 30, 2023

erogol commented Jul 4, 2023

Iamgoofball commented Jul 4, 2023 •

edited

Loading

jdlproenca commented Jul 5, 2023 •

edited

Loading

erogol commented Jul 6, 2023

Nanayeb34 commented Sep 14, 2023

Adds multispeaker support to inference_onnx with VITS, makes onnx inference faster #2725

Adds multispeaker support to inference_onnx with VITS, makes onnx inference faster #2725

Conversation

Iamgoofball commented Jun 30, 2023 • edited Loading

CLAassistant commented Jun 30, 2023

erogol commented Jul 4, 2023

Iamgoofball commented Jul 4, 2023 • edited Loading

jdlproenca commented Jul 5, 2023 • edited Loading

erogol commented Jul 6, 2023

Nanayeb34 commented Sep 14, 2023

Iamgoofball commented Jun 30, 2023 •

edited

Loading

Iamgoofball commented Jul 4, 2023 •

edited

Loading

jdlproenca commented Jul 5, 2023 •

edited

Loading