I want to parse a HTML code and create objects from their text representation in table. I have several columns and I want to save context of certain columns on every row.
Now, I have the HTML code and I understand I should use Pattern and Matcher to get those strings, but I don’t know how to write required regular expression.
This is a row I will be parsing:
<tr><td><a href="delirium.htm">Delirium</a></td><td>65...</tr>
So, I want to extract Delirium from that string. How do I write regular expression that sais
get me the string that is between the string htm"> and </a></td>
?
This is a common question on SO and the answer is always the same: regular expressions are a poor and limited tool for parsing HTML because HTML is not a regular language.
You should be using an HTML parser, for example HTML Parser.
If you’re curious what I mean by “regular language”, have a look at JMD, Markdown and a Brief Overview of Parsing and Compilers. Basically a regular expression is a DFA (deterministic finite automaton or deterministic finite state machine). HTML requires a PDA (pushdown automaton) to parse. A PDA is a DFA with a stack. It’s how it handles recursive elements.