Context
The case is screen scraping web content using QuotaXML SDK 1.6 to finally display the data on the dashboard and on the iPhone.
This QuotaXML tool offers regex for extracting table data only.
QuotaXML does parse html tables using a three step approach.
1. First it identifies the table, for example using “(?si)<table.*?>(.*?)</table>”
2. Second within this parsed table it identifies rows, like “(?si)<tr.*?>(.*?)</tr>”
3. Third within this row scope, individual cells are identified like “(?si)<tr.*?>(.*?)</tr>“
The problem
The source html contains some rows that are not relevant data like lines or images that span full table width using a colspan.
Or tables contain data cells which are not relevant to the data lines needed, like call detail records which also contain calls to freephones which are not substracted from the minutes in your plan, in this case 0800 and 00800 numbers.
In other words (.*?) may not match ‘ colspan=”‘ neither ‘>0800’ neither ‘>00800’.
In code:
exclude:<tr><td colspan="2"></td></tr>
include:<tr><td><strong>Date</strong></td><td><strong>Time</strong></td></tr>
exclude:<tr><td>05-01-2011</td><td>08004913</td></tr>
include:<tr><td>05-01-2011</td><td>0123456789</td></tr>
Homework done
Even trying my first (start simple) tries to only exclude colspan are all failing:
(?si)<tr.*?>(?!colspan)(.*?)</tr>(?si)<tr.*?>(.*?)(?!colspan)</tr>(?si)<tr.*?>.*?[^colspan].*?</tr>(?si)<tr(\s[^>]*)?>.*?(?!colspan).*?</tr>(?si)<tr(\s[^>]*)?>.*?(!colspan).*?</tr>(?si)<tr(\s[^>]*)?>(.*?)(?!colspan)</tr>(?si)<tr.*?>^(?!.*?colspan=").*?</tr>
How to negate specific word in regex? seems related though these suggestions don’t result in a match at all.(?si)<tr.*?>(.(?<!colspan))*?</tr>(?si)<tr.*?>(?!.*colspan).*</tr>
Neither do give do positive and negative lookarounds using http://www.regular-expressions.info/lookaround.html the clue.
How should I correctly write this regex?
The first problem you’re having is that your original expressions are very fragile, because of the “.*?>” intended to match everything up to the earliest “>” — but which will actually match to the following “>” if the rest of the expression fails and backtracks.
Use a construct like “[^>]*>” instead.
The second problem is that you’re misinterpreting the meaning of the negative lookahead: it’s not checking that the given pattern does not occur ahead of its position — it’s looking ahead of its position to check that the pattern does not occur AT THAT POSITION.
With these two changes, your first attempt was very close to solving your test cases:
Note this will still fail to solve the whole problem because the presence of a “colspan” or 800 number later in the string will block the match. You need further test cases, such as:
So you need to ensure that the negative lookahead never crosses to the next :
At which point one may wonder whether RegExps are the right tool for this particular problem 🙂