I’m thinking of doing a language identification program using C language. I already searched in the internet and found the N-Gram-Based Text Categorization article, and I also created my own set of utilities to handle some of my programming needs. Now, I would like to try first creating a simple program that printf japanese word, written in hiragana, katakana, and kanji. I believed this can be done in C language, but I’m not sure on how to implement it, maybe this is related to unicode programming. Can anyone try to explain to me what I need to learn first, what library(/ies) I need to #include, or what utilities can be use as my basis of doing and implementing this program.
I’m thinking of doing a language identification program using C language. I already searched
Share
I don’t think C is the best choice for this project. IMO you should look into using higher level languages (like C#) which have some phenomenal built in support, just a quick example:
C#:
Boom. Done.
Now in C, to the best of my knowledge, there’s no simple standard encoding/decoding libraries or utilities. You’ll have to create this stuff by hand. I started doing that at one point myself, but realized it was a waste of my time. 🙂
If you insist on C, I would suggest you start by reading everything about different types of encodings (multibyte/widebyte encoding). There’s lots of good tutorials on Unicode around the web to get you started (here’s a good one I used).
EDIT: OK, if no C#, then let’s take a “short” example in C… again, this assumes you know something about encoding (note the use of the wide char: wchar_t):
That’s Chinese… I think it’s the same Kanji, but I’m not great with Japanese…
There is how you can print, now storing works similar, you’ll store in a wchar_t, then do your comparisons.