I have a collection of HTML documents for which I need to parse the

Question

0

Asked: May 10, 20262026-05-10T21:25:16+00:00 2026-05-10T21:25:16+00:00

I have a collection of HTML documents for which I need to parse the

0

I have a collection of HTML documents for which I need to parse the contents of the <meta> tags in the <head> section. These are the only HTML tags whose values I’m interested in, i.e. I don’t need to parse anything in the <body> section.

I’ve attempted to parse these values using the XPath support provided by JDom. However, this isn’t working out too well because a lot of the HTML in the <body> section is not valid XML.

Does anyone have any suggestions for how I might go about parsing these tag values in manner that can deal with malformed HTML?

Cheers, Don

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T21:25:17+00:00

2026-05-10T21:25:17+00:00Added an answer on May 10, 2026 at 9:25 pm

You can likely use the Jericho HTML Parser. In particular, have a look at this to see how you can go about finding specific tags.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a collection of HTML documents for which I need to parse the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply