I need to parse a html document that has been generated by saving a

Question

0

Asked: May 23, 20262026-05-23T09:00:08+00:00 2026-05-23T09:00:08+00:00

I need to parse a html document that has been generated by saving a

0

I need to parse a html document that has been generated by saving a word document as html.

I have been using the HTML agility pack quite successfully but in this instance I figured using regex for this one part might be easier (opinions?)

Word generates the following code when it translates one of its footnotes into html

<a href="#_ftn2" name="_ftnref2" title=""><span
class=MsoFootnoteReference><span class=MsoFootnoteReference><span
style='font-size:10.0pt'>[2]</span></span></span></a>

This output is consistent for every footnote with only the href= and name changing as well as the [2] text.

I need to extract the _ftn2 and [2] elements.

So far I have the following regex which will extract the _ftn2 part into the name group

<a href="#(?<name>_ftn\d).*>(<span class=MsoFootNoteReference>)

I’m having a bit of trouble parsing the second bit with all those span tags.

Is it going to be easier to use regex for this or should I continue to use the HAP for this part?

An an aside does anyone know why word generates nested identical span tags

<span class=MsoFootnoteReference>

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T09:00:09+00:00

If the input follows exactly that format then you can get away with a pretty loose regex. You just need to ignore everything except the parts you want to extract and then employ non-greedy expressions to eat up all the garbage between them:

<a href="#(?<name>_ftn\d).*?(?<number>\[\d+\]).*?<\/a>

You can use a non-greedy .*? to eat up all the extra markup because nothing in there will match your next \[\d+\] pattern. You don’t really need the .*?<\/a> bit on the end, that’s mostly for symmetry and a bit of extra paranoia.

Something like this is probably one of the few cases where using regular expressions to rip apart HTML makes sense. You could do this sort of thing with an HTML parser but then you’d be a nightmare of twisty XPath expressions (all of which look alike), DOM manipulations, or SAX events. And you might even get eaten by a grue.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to parse a html document that has been generated by saving a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply