What is the best tokenizer exist for processing Korean language?
I have tried CJKTokenizer in Solr4.0. It is doing the tokenization, but accuracy is very low.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
POSTECH/K is a Korean Morphological Analyzer that is able to tokenize and POS tag Korean data without much effort. The software reports 90.7% on the corpus it was train and tested on (see http://nlp.postech.ac.kr/download/postag_k/9908_cljournal_gblee.pdf).
The POS tagging achieved 81% on the korean data of a multilingual corpus project i’ve been working on.
However, there’s a catch, you have to use windows to run the software. But I’ve a script to bypass that limitation, here’s the script:
Note that the encoding for POSTECH/K is
euc-kr, so if it’sutf8. you can use the following script to convert from utf8 to euc-kr.(source for sejong-shell : Liling Tan. 2011. Building the foundation text for Nanyang Technological University – Multilingual Corpus (NTU-MC). Final year project. Singapore: Nanyang Technological University. pp. 44.)