-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jyutping Improvement #4
Comments
let me know if you need any help. I'm currently developing this project https://github.com/hockyy/miteiru |
@graphemecluster 據我所知粵典數據係一早就已經用咗嘅?而家嘅更新主要係用咗 Jon 嘅字型數據? |
而家淨係用 Jon 嘅數據,但都肯定準過結巴分詞 |
@hockyy The accuracy should reach more than 99% since our latest updates (JS/TS version 2.0.0 / Python version 0.3.0) a few days ago. |
ack ack okk thank you info |
我聽日debug啊好眼瞓😪 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I don't know how you farm those jyutping,
https://words.hk/faiman/analysis/wordslist.json
https://words.hk/faiman/analysis/charlist.json
but anyway, if you haven't included this method, I think you can try. I'm too lazy to code a new library so I will use your
to-jyutping
.Just so if you wanna update the dictionary, you can parse all the words from there, for the tokenizer, we can use jieba
https://github.com/hockyy/jieba-cantonese
I've made a script to auto generate jieba user dict to tokenize, so querying jyutping per token can be better, if the result don't exist, fall back to per character jyutping
The text was updated successfully, but these errors were encountered: