Here's one for java and one for c# and here's…

Question

0

Asked: May 11, 20262026-05-11T15:46:17+00:00 2026-05-11T15:46:17+00:00

One mistake I see people making over and over again is trying to parse

0

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag attr='5' />

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src='imgtag.gif' alt='<img>' />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id='outer'><span id='inner'>foo</span></span>

People often want to match against the content of a document (such as the famous ‘find all phone numbers on a given page’ problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class='phonenum'>(<span class='area code'>703</span>) <span class='prefix'>348</span>-<span class='linenum'>3020</span></span>

Comments may contain poorly formatted or incomplete tags:

<a href='foo'>foo</a> <!-- FIXME:     <a href=' --> <a href='bar'>bar</a>

What other gotchas are you aware of?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T15:46:18+00:00

Here’s some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y 'a]>b'> ]> <x>     <a b='&y;>' />     <![CDATA[[a>b <a>b <a]]>     <?x <a> <!-- <b> ?> c --> d </x>

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'http://www.w3.org/TR/html4/loose.dtd' [     <!ENTITY % e 'href='hello''>     <!ENTITY e '<a %e;>'> ]>     <title>x</TITLE> </head>     <p id  =  a:b center>     <span / hello </span>     &amp<br left>     <!---- >t<!---> < -->     &e link </a> </body>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'   'http://www.w3.org/TR/html4/strict.dtd'>  <HTML/   <HEAD/     <TITLE/>/     <P/>

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions