Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds multispeaker support to inference_onnx with VITS, makes onnx inference faster #2725

Closed
wants to merge 1 commit into from

Conversation

Iamgoofball
Copy link

@Iamgoofball Iamgoofball commented Jun 30, 2023

Adjusts the provider choice for CUDAExecutionProvider based on microsoft/onnxruntime#12880 (comment) and #2563 (comment)

Fixed onnx inference to support multispeaker.

Benched it against a personal multispeaker model with the following setup:

time_list = []
text_inputs_list = []
num_speakers = range(265)
print("Beginning 500 32 buffalo random-speaker inference test with text inputs calc'd each time")
for i in range(500):
    starttime = time.time()
    text = "Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo Buffalo."
    text_inputs = np.asarray(
        vits.tokenizer.text_to_ids(text, language="en"),
        dtype=np.int64,
    )[None, :]
    text_inputs_list.append(time.time()-starttime)
    starttime = time.time()
    audio = vits.inference_onnx(text_inputs, speaker_id=random.choice(num_speakers))
    time_list.append(time.time()-starttime)
    save_wav(wav=audio[0], path="./buffalo_samples/coqui_vits _" + str(i) + ".wav", sample_rate=config.audio.sample_rate)
print("Mean text_inputs time: " + str(statistics.mean(text_inputs_list)))
print("Mean inference time: " + str(statistics.mean(time_list)))
print("Total inference time: " + str(sum(time_list))))

We were able to generate 2 hours and 15 minutes of roughly 15-20 second clips within 32 seconds on an RTX 4080 using this setup.

Beginning 500 32 buffalo random-speaker inference test with text inputs calc'd each time
Mean text_inputs time: 0.03477813768386841
Mean inference time: 0.06550791311264038
Total inference time: 32.75395655632019

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@erogol
Copy link
Member

erogol commented Jul 4, 2023

thanks for the PR @Iamgoofball

Do you plan to sign the CLA?

@Iamgoofball
Copy link
Author

Iamgoofball commented Jul 4, 2023

I don't like putting my real name on things on this account, sorry. If you wanna mirror the PR though go for it, it's like 2 lines anyways and one of them's just borrowed from the comment linked in the OP about onnx params, I'll close this one when you do.

@jdlproenca
Copy link

jdlproenca commented Jul 5, 2023

Thank you @Iamgoofball !
Although it might be a rare use-case, I have a multi-speaker model with only 1 speaker, and I had to make one extra change in export_onnx from if self.num_speakers > 1: to if self.num_speakers > 0: here:

if self.num_speakers > 1:

Does it make sense in general? I see more if self.num_speakers > 0: in vits.py

@erogol
Copy link
Member

erogol commented Jul 6, 2023

Closing this for. #2743

@erogol erogol closed this Jul 6, 2023
@Nanayeb34
Copy link

what version of TTS are you using? @Iamgoofball

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants