I’m thinking of doing a language identification program using C language. I already searched

Question

0

Asked: June 11, 20262026-06-11T07:03:43+00:00 2026-06-11T07:03:43+00:00

I’m thinking of doing a language identification program using C language. I already searched

0

I’m thinking of doing a language identification program using C language. I already searched in the internet and found the N-Gram-Based Text Categorization article, and I also created my own set of utilities to handle some of my programming needs. Now, I would like to try first creating a simple program that printf japanese word, written in hiragana, katakana, and kanji. I believed this can be done in C language, but I’m not sure on how to implement it, maybe this is related to unicode programming. Can anyone try to explain to me what I need to learn first, what library(/ies) I need to #include, or what utilities can be use as my basis of doing and implementing this program.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T07:03:44+00:00

I don’t think C is the best choice for this project. IMO you should look into using higher level languages (like C#) which have some phenomenal built in support, just a quick example:

C#:

byte[] buffer = new byte[] { 0xE8, 0x82, 0xB2, 0xE5, 0x84, 0xBF }; 
string s = Encoding.UTF8.GetString(buffer);

Boom. Done.

Now in C, to the best of my knowledge, there’s no simple standard encoding/decoding libraries or utilities. You’ll have to create this stuff by hand. I started doing that at one point myself, but realized it was a waste of my time. 🙂

If you insist on C, I would suggest you start by reading everything about different types of encodings (multibyte/widebyte encoding). There’s lots of good tutorials on Unicode around the web to get you started (here’s a good one I used).

EDIT: OK, if no C#, then let’s take a “short” example in C… again, this assumes you know something about encoding (note the use of the wide char: wchar_t):

#include <stdio.h>  
#include <stdlib.h>
#pragma import(__use_utf8_ctype)
#include <wchar.h>
#include <locale.h>

int main(int argc, char *argv[]) 
{
  wchar_t water = 27750;
  setlocale(LC_ALL, "");

  printf("%lc",water);
  return 0;
}

mike@linux-4puc:~> ./a.out 
汦

That’s Chinese… I think it’s the same Kanji, but I’m not great with Japanese…
There is how you can print, now storing works similar, you’ll store in a wchar_t, then do your comparisons.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m thinking of doing a language identification program using C language. I already searched

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply