I’m working on crawling pages for information, and have run into many problems with

Question

0

Asked: May 12, 20262026-05-12T07:59:55+00:00 2026-05-12T07:59:55+00:00

I’m working on crawling pages for information, and have run into many problems with

0

I’m working on crawling pages for information, and have run into many problems with parsing the pages in Groovy. I’ve made semi-solution that works most of the time using juniversal chardet and just scanning the page for tag in the head, but sometimes two of these tags are found on one page, for example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Is there a standard on which one to use (first, last, both..?) or some easier way to do this? Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T07:59:56+00:00

Editorial Team

2026-05-12T07:59:56+00:00Added an answer on May 12, 2026 at 7:59 am

I would do it heuristically:

Is everything actually ASCII? If so, it doesn’t matter which you use.
Does it conform to valid UTF-8? If so, I’d use that.
Otherwise, use ISO-8859-1.

You might want to look at the content-type header coming back from the web server, too…

Fundamentally the page is broken, but the above should give a reasonable “best guess.”

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on crawling pages for information, and have run into many problems with

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply