I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it?
Chrome can do it, but what’s the principle?
I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?
I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)… there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.