I have a string and need a RegEx Pattern for this, so I can extract only the date and the numbers from the tags:
Dim a as string= "<table id=table-1 > <tbody> <td align=right> <h2 id=date-one>12.09.2010</h2> </td> </tr> </tbody></table> <table id=table-2 border=0 cellspacing=0 cellpadding=0><tbody><tr><td align=center valign=middle><h3 id=nb-a>01</h3></td><td align=center valign=middle><h3 id=nb-a>>02</h3></td><td align=center valign=middle><h3 id=nb-a>03</h3></td></tr></tbody></table>"
This string will have more than one block of similar data …so I must be in loop …
Thank you!
Adrian
An html parser (e.g., the HtmlAgilityPack) will be simpler in the long term but as a guide to Regex here’s how to do it for your case:
Naively for the first attempt match any numbers:
This of course matches way too many items and does not match the date as a single group. In your case each match is inside a tag and so is preceeded by a ‘>’ and followed by a ‘<‘.
This unforturnately includes the ‘>’ and the ‘<‘ in your matches. Now we need positive lookbehind and positive lookahead:
Now things are looking good because we’re only matching the date and three numbers! However, what if the date was separated by ‘-‘ or ‘/’ instead of ‘.’?
Easily handled. But what if there are spaces before or after the number or date within the element text?
Not too bad. The only problem is that this method still takes more effort and breaks more easily than using an html parser to loop through all the elements, check if the element text is a valid number or date, and add the matching elements text to a list.
Consider for example altering the Regex method to handle currencies (where “$100.03.45” should not match) or commas in numbers or ensuring that dates have exactly three groups, each with one, two, or four digits, where only one group can have four, and one of the two digit groups can not exceed 12, etc. Insanity lies down that road.