Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-filter words whose diacrictic forms are not in the dictionary #15

Open
ruohoruotsi opened this issue Jul 29, 2019 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@ruohoruotsi
Copy link
Member

Pre-filter words whose non-diacrictized word-forms are not in the dictionary, before asking the model to do ADR. This way we can get more predictable results and error messages for Out-Of-Vocabulary words (OOV)

If the model sees a word like elerindodo, validate that this word's diacritic form exists in the dictionary and return an error message if it doesn't! This way, since the model doesn't know about elerindodo, it can just say so, rather than confuse the users by returning the "top probability word" which may be a random thing like aláǹtakùn!

@ruohoruotsi ruohoruotsi added the bug Something isn't working label Jul 29, 2019
@ruohoruotsi ruohoruotsi self-assigned this Jul 29, 2019
@ruohoruotsi
Copy link
Member Author

ruohoruotsi commented Jul 29, 2019

@Olamyy nicely points out that

a word2vec/sentence2vec model might word really well here. For every entry(word/sentence) a user inputs, try to find the word in the model vocabulary. If it doesn't exist, either raise an error or get the closest word in the vocab. I suppose fasttext would work well here since it uses subword (ngram) sets.
The challenge here might just be the extra step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant