Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

跨语种复制模式下从日语到中文会出现粤语输出 Cantonese output appears from Japanese to Chinese in cross-language copying mode #385

Open
liujiaqi7998 opened this issue Sep 12, 2024 · 6 comments

Comments

@liujiaqi7998
Copy link

Describe the bug

跨语种复制模式下从日语到中文会出现粤语输出
For Title , Cantonese output appears from Japanese to Chinese in cross-language copying mode

Reapped

  1. Get some pure human voice sets of Japanese as a reference sample for cross -language replication
  2. Create a MAP table to represent the Chinese audio content to be generated
  3. Use the following code conversion
    ` tts_text = "<|zh|>" + 目标输出文字 prompt_speech_16k = load_wav(person_voice_file, prompt_sr) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)): torchaudio.save(chinese_person_voice_file, j['tts_speech'], 22050) `
  4. The phenomenon of mixing of Mandarin and Cantonese output results

Expected behavior

Data sets: 433 original audio and corresponding pre -generated Chinese content. The average audio is within 3 seconds, and the pre -generated text is about 5 words.
Conclusion: After joining the "<| zh |>" limit, more than 50%of the content still appears in Cantonese


复现

  1. 获取到一些日语的纯净人声数据集作为 跨语种复制 的参考样本
  2. 建立一个MAP表表示要生成的中文音频内容
  3. 使用如下代码转换
tts_text = "<|zh|>" + 目标输出文字
prompt_speech_16k = load_wav(person_voice_file, prompt_sr)
for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)):
    torchaudio.save(chinese_person_voice_file, j['tts_speech'], 22050)
  1. 输出结果出现普通话和粤语混合的现象

预期行为

数据集:433条原始音频和对应的预生成中文内容,音频平均时长在3秒内,预生成文字在5字左右
结论:在加入“<|zh|>”限制后,仍然有超过50%的内容出现了粤语

@aluminumbox
Copy link
Collaborator

well this is the drawback of bpe tokenize. zero shot/cross lingual mode is not so stable because chinese and Cantonese have same character

@liujiaqi7998
Copy link
Author

Thanks a lot
Yes, it's exactly what I expected
My guess is that the model is trained to use the same string in Chinese and Cantonese
For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected

@aluminumbox
Copy link
Collaborator

Thanks a lot Yes, it's exactly what I expected My guess is that the model is trained to use the same string in Chinese and Cantonese For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected

nice trick

@Anmidy
Copy link

Anmidy commented Sep 14, 2024

@liujiaqi7998
你好,请问你的 tts_text参数中目标文字是 日语文本吗?person_voice_file.wav 文件是日语音频吗?这个代码是想将日语文本生成中文音频吗?

我和你的相反,想将中文文本生成日语音频,代码如下所示:

cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M')
    tts_text = "<|jp|>你好"
    prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050)
    for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)):
       torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050)

cross_lingual_jp.wav 音频文件是日语音频文件,但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是 中文,并不是预期的日语,请问需要怎么修改呢?

@liujiaqi7998
Copy link
Author

liujiaqi7998 commented Sep 14, 2024

你好,请问你的 tts_text参数中目标文字是 日语文本吗?person_voice_file.wav 文件是日语音频吗?这个代码是想将日语文本生成中文音频吗? 我和你的相反,想将中文文本生成日语音频,代码如下所示: cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M') tts_text = "<|jp|>你好" prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)): torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050) cross_lingual_jp.wav 音频文件是日语音频文件,但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是 中文,并不是预期的日语,请问需要怎么修改呢?

@Anmidy 首先模型的输出和输入的字符串相关,你需要将“你好”翻译成“こんにちは”,load_wav理论上加载源语言的音频(存疑)

@Anmidy
Copy link

Anmidy commented Sep 14, 2024

@liujiaqi7998 意思是三个方法:inference_sft、inference_zero_shot和inference_cross_lingual,并不能直接将中文文本转成日语音频吗?
但是readme中的这个例子,感觉像是将英文文本转成中文音频了样,是我理解的有偏差吗?

# cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
    torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants