Ok to start I know i should not be using Regex to parse HTML as it’s not very reliable, not 100% safe, etc. However, this is just a learning excercise for regex as much as anything else.
So my example uses the bbc website http://www.bbc.co.uk/sport/football/premier-league/table.
The project is parsing the tbody of the first table. I am trying to do a search so that only the elements matching a search value are returned. For example, given the search “manc” i would want the tr tag for manchester city and manchester united (matched from the url).
What i have so far is <tr\b[^>]*>(.*?)manc(.*?)</tr> however this matches from the first tr to the closing tr after man city and then returns the expected result for man utd. Could anyone point out where i’ve gone wrong with this regex.
Edit: Source (Trimmed)
<tbody id="trc-20-118996114-3">
<tr id="team-138824012" class="team first">
<td class="statistics"></td>
<td class='position'>
<span class='moving-up'>Moving up</span>
<span class='position-number'>1</span>
</td>
<td class="team-name">
<a href='http://www.bbc.co.uk/sport/football/teams/arsenal'>Arsenal</a>
</td>
<td class="played">0</td>
<td class="home-won">
<span>0</span>
</td>
<td class="home-drawn">0</td>
<td class="home-lost">0</td>
<td class="home-for">0</td>
<td class="home-against">0</td>
<td class="away-won">
<span>0</span>
</td>
<td class="away-drawn">0</td>
<td class="away-lost">0</td>
<td class="away-for">0</td>
<td class="away-against">0</td>
<td class="goal-difference">0</td>
<td class="points">0</td>
<td class="last-10-games">
<ol>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win last" title="Win">
<span>Win</span>
</li>
</ol>
</td>
<td class="status">
<a class="report" href="http://www.bbc.co.uk/sport/0/football/17973141">Report</a>
</td>
</tr>
<tr id="team-137316633" class="team">
<td class="statistics"></td>
<td class='position'>
<span class='moving-up'>Moving up</span>
<span class='position-number'>2</span>
</td>
<td class="team-name">
<a href='http://www.bbc.co.uk/sport/football/teams/aston-villa'>Aston Villa</a>
</td>
<td class="played">0</td>
<td class="home-won">
<span>0</span>
</td>
<td class="home-drawn">0</td>
<td class="home-lost">0</td>
<td class="home-for">0</td>
<td class="home-against">0</td>
<td class="away-won">
<span>0</span>
</td>
<td class="away-drawn">0</td>
<td class="away-lost">0</td>
<td class="away-for">0</td>
<td class="away-against">0</td>
<td class="goal-difference">0</td>
<td class="points">0</td>
<td class="last-10-games">
<ol>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="loss last" title="Loss">
<span>Loss</span>
</li>
</ol>
</td>
<td class="status">
<a class="report" href="http://www.bbc.co.uk/sport/0/football/17973120">Report</a>
</td>
</tr>
<tr id="team-137318151" class="team">
<td class="statistics"></td>
<td class='position'>
<span class='moving-down'>Moving down</span>
<span class='position-number'>7</span>
</td>
<td class="team-name">
<a href='http://www.bbc.co.uk/sport/football/teams/manchester-city'>Man City</a>
</td>
<td class="played">0</td>
<td class="home-won">
<span>0</span>
</td>
<td class="home-drawn">0</td>
<td class="home-lost">0</td>
<td class="home-for">0</td>
<td class="home-against">0</td>
<td class="away-won">
<span>0</span>
</td>
<td class="away-drawn">0</td>
<td class="away-lost">0</td>
<td class="away-for">0</td>
<td class="away-against">0</td>
<td class="goal-difference">0</td>
<td class="points">0</td>
<td class="last-10-games">
<ol>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="win last" title="Win">
<span>Win</span>
</li>
</ol>
</td>
<td class="status">
<a class="report" href="http://www.bbc.co.uk/sport/0/football/17973148">Report</a>
</td>
</tr>
<tr id="team-137318152" class="team">
<td class="statistics"></td>
<td class='position'>
<span class='moving-down'>Moving down</span>
<span class='position-number'>8</span>
</td>
<td class="team-name">
<a href='http://www.bbc.co.uk/sport/football/teams/manchester-united'>Man Utd</a>
</td>
<td class="played">0</td>
<td class="home-won">
<span>0</span>
</td>
<td class="home-drawn">0</td>
<td class="home-lost">0</td>
<td class="home-for">0</td>
<td class="home-against">0</td>
<td class="away-won">
<span>0</span>
</td>
<td class="away-drawn">0</td>
<td class="away-lost">0</td>
<td class="away-for">0</td>
<td class="away-against">0</td>
<td class="goal-difference">0</td>
<td class="points">0</td>
<td class="last-10-games">
<ol>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="draw" title="Draw">
<span>Draw</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="loss" title="Loss">
<span>Loss</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win" title="Win">
<span>Win</span>
</li>
<li class="win last" title="Win">
<span>Win</span>
</li>
</ol>
</td>
<td class="status">
<a class="report" href="http://www.bbc.co.uk/sport/0/football/17973162">Report</a>
</td>
</tr>
</tbody>
The problem is, your regular expression is too broad. Look what you’re asking for:
Lets simplify it a little bit.
So you’re saying, ok. I need to match a tr followed by anything and then manc and then ANYTHING and then a closing tr. So. Of course what happens is the regex starts at the first tr and goes ok. I’ve got a tr let me keep matching until I find manc. In the meantime, you probably just passed a bunch of other tr. But your regex doesn’t care.
Try this:
Or, I guess in your example: