I am currently attempting (or planning to attempt) to write a simple (as possible)

Question

0

Asked: May 25, 20262026-05-25T01:53:58+00:00 2026-05-25T01:53:58+00:00

I am currently attempting (or planning to attempt) to write a simple (as possible)

0

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.

After googling I have found many answers saying “don’t do it it’s been done” (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn’t use Regular expresions. However I haven’t found any guides on the “right” way to write a parser. (This, by the way, is something I’m attempting more as a learning exersise than anything so I’d quite like to do it rather than use a premade one)

I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.

So my question is this: what would you recommend as a way of dealing with this? The only idea I’ve had is to treat it in a similar way as the XML but have a list of tags that aren’t necessarily closed each with conditions for closure (e.g. <p> ends on </p> or next <p> tag).

Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T01:53:58+00:00

so, I’ll try for an answer here –

basically, what makes “plain” html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser.
You will need a validator along with the parser, to build your tree. But you’ll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you’ll know it’s an error and not just sloppy html.

know all the rules, build a validator, and then you’ll be able to build a parser. that’s Plan A.

Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a “good” layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

hope that helped!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am currently attempting (or planning to attempt) to write a simple (as possible)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply