I was just reviewing a previous post I made and noticed a number of people suggesting that I don’t use Regex to parse xml. In that case the xml was relatively simple, and Regex didn’t pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I’m curious how this might pose a problem in other cases. Is this just a ‘don’t reinvent the wheel’ type of issue?
Share
The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It’s possible with balanced matching, but that’s only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.
For example, this is a tricky one to parse…
You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there’s no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.