I am trying to get corpus for a certain language. But when I get

Question

0

Asked: May 26, 20262026-05-26T15:23:14+00:00 2026-05-26T15:23:14+00:00

I am trying to get corpus for a certain language. But when I get

0

I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it?
Chrome can do it, but what’s the principle?

I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T15:23:14+00:00

Editorial Team

2026-05-26T15:23:14+00:00Added an answer on May 26, 2026 at 3:23 pm

I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)… there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to get corpus for a certain language. But when I get

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply