Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6803075
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T19:18:01+00:00 2026-05-26T19:18:01+00:00

This is the beginning — I have a file on disk which is HTML

  • 0

This is the beginning — I have a file on disk which is HTML page. When I open it with regular web browser it displays as it should — i.e. no matter what encoding is used, I see correct national characters.

Then I come — my task is to load the same file, parse it, and print out some pieces on the screen (console) — let’s say, all <hX> texts. Of course I would like to see only correct characters, not some mambo-jumbo. The last step is changing some of text, and save the file.

So the parser has to parse and handle encoding in both ways as well. So far I am unaware of parser which is even capable of loading data correctly.

Question

What parser would you recommend?

Details

HTML page in general has the encoding given in header (in meta tag), so parser should use it. The scenario I have to look in advance and check the encoding, and then manually set the encoding in code is no-go. For example, this is taken from JSoup tutorials:

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

I cannot do such thing, parser has to handle encoding detection by itself.


In C# I faced similar problem with loading html. I used HTMLAgilityPack and first executed encoding detection, then using it I encoded the data stream, and after that I parsed the data. So, I did both steps explicitly, but since the library delivers both methods it is fine with me.

Such explicit separation might be even better, because it would be possible to use in case of missing header probabilistic encoding detection method.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T19:18:02+00:00Added an answer on May 26, 2026 at 7:18 pm

    The Jsoup API reference says for that parse method that if you provide null as the second argument (the encoding one), it’ll use the http-equiv meta-tag to determine the encoding. So it looks like it already does the “parse a bit, determine encoding, re-parse with proper encoding” routine. Normally such parsers should be capable of resolving the encoding themselves using any means available to them. I know that SAX parsers in Java are supposed to use byte-order marks and the XML declaration to try and establish an encoding.

    Apparently Jsoup will default to UTF-8 if no proper meta-tag is found. As they say in the documentation, this is “usually safe” since UTF-8 is compatible with a host of common encodings for the lower code points. But I take it that “usually safe” might not really be good enough in this case.

    If you don’t sufficiently trust Jsoup to detect the encoding, I see two alternatives:

    • Should you somehow be ascertained that the HTML is always in fact XHTML, then an XML parser might prove a better fit. But that would only work if the input is definitely XML compliant.
    • Do a heuristic encoding detection yourself by trying to use byte-order marks, parsing a portion using common encodings and finding a meta-tag, detecting the encoding by byte patterns you’d expect in header tags and finally, all else failing, use a default.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

At the beginning of a makefile I have this line : PATH := $(PATH):/other/dir
I have spent too much time on this problem and am beginning to think
I have a form in my Profile edit view beginning with this line: <%
I have $weMountedBoot set to false like this in the beginning of my script:
In a php page, there is a javascript function at the beginning like this:-
i have a working example which uses this url http://api.flickr.com/services/feeds/photos_public.gne?jsoncallback=jQuery16201154390876987067_1314382298849&tags=cat&tagmode=any&format=json&_=1314382298856 which gives me this
I have a file that declares a namespace in the beginning, then does an
I'm just beginning this nice hashkell beginners tutorial: http://learnyouahaskell.com on this page on lists
At the beginning this worked fine: $ rake cucumber:all But then $ script/plugin install
I'm sorry, but this is beginning to feel like kicking myself in the head.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.