I am using PHP/Regex to parse some data for an application. The pages I am parsing have table formats that include a header followed by a bunch of items. What I am trying to do is get the header for each table, along with all of the items so that I can label each item as part of that group (defined by the header).
I currently have it set up with an expression matching each header, and then everything up to the next header. I then use a loop on the header count to match the additional data from the second match in the first expression.
So basically:
preg_match_all ('#table-header.*?>(.*?)<\/td>(.*?)table-header#s', $url, $gr, PREG_PATTERN_ORDER);
for($i = 0; $i < count($gr[0]); $i++) {
preg_match_all ('#type_id.*?<b>(.*?)</b> ... #s', $gr[2][$i], $info, PREG_PATTERN_ORDER);
$group = trim($gr[1][$i]);
for($ii = 0; $ii < count($info[0]); $ii++) {
$name = trim($info[1][$ii]);
...
}
}
My issue is that it is skipping every other group, what I can only assume is because it matches table-header to table-header, and then skips to the next table-header instead of starting the next match with the ending table-header of the first match. How can I get it to start the next match with the end point of the previous match? Unfortunately the pages do not have enough unique items near the beginning/end points to use something different to match. The code looks similar to this:
<td align='center' class='table-header' colspan='18' valign='top'>
Header
</td>
...items...
<td align='center' class='table-header' colspan='18' valign='top'>
Header 2
</td>
I tried using the colspan as the start of my expression, and grabbing everything up to the next table-header, but it just breaks.
Thanks for any suggestions.
You should have a look to this class instead:
http://simplehtmldom.sourceforge.net/