I’m working on crawling pages for information, and have run into many problems with parsing the pages in Groovy. I’ve made semi-solution that works most of the time using juniversal chardet and just scanning the page for tag in the head, but sometimes two of these tags are found on one page, for example:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Is there a standard on which one to use (first, last, both..?) or some easier way to do this? Thanks.
I would do it heuristically:
You might want to look at the content-type header coming back from the web server, too…
Fundamentally the page is broken, but the above should give a reasonable “best guess.”