Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6377635
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T01:53:58+00:00 2026-05-25T01:53:58+00:00

I am currently attempting (or planning to attempt) to write a simple (as possible)

  • 0

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.

After googling I have found many answers saying “don’t do it it’s been done” (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn’t use Regular expresions. However I haven’t found any guides on the “right” way to write a parser. (This, by the way, is something I’m attempting more as a learning exersise than anything so I’d quite like to do it rather than use a premade one)

I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.

So my question is this: what would you recommend as a way of dealing with this? The only idea I’ve had is to treat it in a similar way as the XML but have a list of tags that aren’t necessarily closed each with conditions for closure (e.g. <p> ends on </p> or next <p> tag).

Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T01:53:58+00:00Added an answer on May 25, 2026 at 1:53 am

    so, I’ll try for an answer here –

    basically, what makes “plain” html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser.
    You will need a validator along with the parser, to build your tree. But you’ll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you’ll know it’s an error and not just sloppy html.

    know all the rules, build a validator, and then you’ll be able to build a parser. that’s Plan A.

    Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a “good” layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

    hope that helped!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm currently attempting to write my own program that mirrors the pmap command, specifically
I am currently having problems with attempting to style the HTML rich text editor
I am currently attempting to revise a Java Homework program for my Data Structures
I'm currently attempting to implement having a (signed) applet communicate to a server program
I'm currently attempting to print a document from WPF. I'm using the web browser
I'm currently attempting to port a Java program over to iOs which utilzes BufferedImage's
I'm currently attempting to integrate a DLL ( FooEmulation ) into an existing project.
I'm currently attempting to create a tabbed interface in a web application, and based
I am currently attempting to implement a custom gridview interface to display data from
I'm currently attempting to migrate a legacy VBA/Microsoft Access application to Python and PyQt.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.