I am trying to segment a Korean string into individual syllable.
So the input would be a string like “서울특별시” and the outcome “서”,”울”,”특”,”별”,”시”.
I have tried with both C++ and Python to segment a string but the result is a series of ? or white spaces respectively (The string itself however can be printed correctly on the screen).
In c++ I have first initialized the input string as string korean="서울특별시" and then used a string::iterator to go through the string and print each individual component.
In Python I have just used a simple for loop.
I have wondering if there is a solution to this problem. Thanks.
I don’t know Korean at all, and can’t comment on the division into syllables, but in Python 2 the following works:
Output:
In Python 3 you don’t need the
ufor Unicode strings.The outputs are the unicode values of the characters in the string, which means that the string has been correctly cut up in this case. The reason I printed them with
repris that the font in the terminal I used, can’t represent them and so withoutreprI just see square boxes. But that’s purely a rendering issue,reprdemonstrates that the data is correct.So, if you know logically how to identify the syllables then you can use
reprto see what your code has actually done. Unicode NFC sounds like a good candidate for actually identifying them (thanks to R. Martinho Fernandes), andunicodedata.normalize()is the way to get that.