Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenisation options for more languages #342

Open
stijn-uva opened this issue Mar 15, 2023 · 1 comment
Open

Add tokenisation options for more languages #342

stijn-uva opened this issue Mar 15, 2023 · 1 comment
Labels
enhancement New feature or request processors Involves self-contained analyticalprocessors.

Comments

@stijn-uva
Copy link
Member

stijn-uva commented Mar 15, 2023

Some languages, in particular East-Asian ones, don't (just) use spaces to separate words, so the standard NLTK tokeniser doesn't work for them. It is likely that there are many languages for which this is an issue but the East-Asian ones are probably the most pressing because they represent a large number of people online.

Support for Chinese tokenisation has been added using jieba. There are other languages to consider, here is a nice overview. But the libraries listed there all have dependencies that make them difficult to install, so more work is needed to figure out how to best make them install with 4CAT.

@stijn-uva stijn-uva added enhancement New feature or request processors Involves self-contained analyticalprocessors. labels Mar 15, 2023
@stijn-uva stijn-uva added this to the 1.40 (Summer School 2023) milestone Mar 15, 2023
@oxygala
Copy link

oxygala commented Mar 28, 2023

I am not sure if I have to open a new issue for this, but tokenisation for Turkish would be fantastic too.
Here's a project that might help: https://github.com/apdullahyayik/TrTokenizer (there's also a stemmer: https://github.com/otuncelli/turkish-stemmer-python)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request processors Involves self-contained analyticalprocessors.
Projects
None yet
Development

No branches or pull requests

2 participants