Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding architecture #159

Open
karandua2016 opened this issue May 23, 2024 · 1 comment
Open

Questions regarding architecture #159

karandua2016 opened this issue May 23, 2024 · 1 comment

Comments

@karandua2016
Copy link

karandua2016 commented May 23, 2024

Hi team

First of all, great job with MetaVoice. Everything in the repository works as expected.

I went through the code to understand the 4 stage inference and correlate it with the documentation, and have a few questions regarding the choice of architecture/models. Please excuse me if these questions are naïve as I am new to speech synthesis.

  1. You mention that you use GPT for the first stage. Is that GPT custom pretrained or you finetune it on top of any other publicly available version of GPT? Similar for the second stage model.
  2. Why can all 8 hierarchies of EnCodec not be predicted together. Why do we need a second stage model?
  3. In the third stage, why is MBD used when EnCodec itself can decode those EnCodec tokens and convert to waveform? Did MBD turn out to be better in your experiments?

Thanks for the amazing work.

@fareedkakar1

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants