Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unigrams probs and add_unigrams_arpa.pl #4933

Open
FredSRichardson opened this issue Sep 6, 2024 · 1 comment
Open

Unigrams probs and add_unigrams_arpa.pl #4933

FredSRichardson opened this issue Sep 6, 2024 · 1 comment
Labels

Comments

@FredSRichardson
Copy link

It looks like the script:

https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/lang/add_unigrams_arpa.pl

Doesn't make any attempt to assure that the unigrams probabilities sum to 1.0. I don't know if this is a problem or not.

My suggestion would be to treat the "scale" parameter is the probability of OOV - P(OOV) - as suggested in the script. Then the following normalizations could be done:

  1. Normalize non-OOV unigrams so they sum to 1 - P(OOV)
  2. Normalize OOV unigrams so they sum to P(OOV)
    That should ensure that the set of specified OOV words is treated as having a collected probability of P(OOV) and the remainder of the lexicon picks up the remaining probability mass.

It may also make sense to ensure that any word specified by the user that already exists in the lexicon is moved to the OOV set so that it inherits the probability specified by the user. I actually don't know if that's a good idea as it will impact all backoff N-grams. So perhaps a warning is better and these words are skipped or an option could exist to apply the user specified probabilities to in vocabulary words if that's really what the user wants to do.

@judyfong
Copy link

I used it in the asr for althingi to actually overload the unigrams. So i saw it as a feature not a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants