Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3958820
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 20, 20262026-05-20T02:39:17+00:00 2026-05-20T02:39:17+00:00

I have a script that needs to determine the charset before being read by

  • 0

I have a script that needs to determine the charset before being read by lxml.HTML() for parsing. I will assume ISO-8859-1(that’s the normal assumed charset for this right?) if it can’t be found and search the html for the meta tag with the charset attribute. However I’m not sure the best way to do that. I could try to create an etree with lxml, but I don’t want to read the whole file since I may run into encoding problems. However, if I don’t read the whole file I can’t build an etree since some tags will not have been closed.

Should I just find the meta tag with some fancy string subscripting and break out of the loop once it’s found or a certain number of lines have been read? Maybe use a low level HTML parser, eg html.parser? Using python3 btw, thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-20T02:39:18+00:00Added an answer on May 20, 2026 at 2:39 am

    You should first try to extract encoding from HTTP headers. If it is not present there, you should parse it with the lxml. This might be tricky since lxml throws parse errors if charset does not match. A work-around would be decoding and encoding the data ignoring the unknown characters.

    html_data=html_data.decode("UTF-8","ignore")
    html_data=html_data.encode("UTF-8","ignore")
    

    After this, you can parse by invoking the lxml.HTML() command with utf-8 encoding.
    This way, you’ll be able to find the correct encoding defined in the HTML headers.

    After finding the encoding, you’ll have to re-parse the HTML document with proper encoding.

    Unfortunately, sometimes you might not find character encoding even in the HTML headers. I’d suggest you using the chardet module to find the proper encoding only after these steps fail.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a PHP script that needs to determine if it's been executed via
I have a script that needs to extract data temporarily to do extra operations
I have a script that needs to run after tomcat has finished starting up
I have a Python script that needs to execute an external program, but for
I have a groovy script that needs a library in a jar. How do
I have to maintain a server-side script written in JScript (NOT Javascript) that needs
I have a script that successfully encrypts a credit card. I need it to
I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example),
I have a script that retrieves objects from a remote server through an Ajax
I have two scripts that often need to be run with the same parameter:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.