I need to extract this text:
Line 1 text.
Line 2 text. Line 2 some more text.
Line 3 text,
Line 4 text
from this HTML:
...
<tr><td class="td_my_custom_text">Line 1 text.
<br>Line 2 text. Line 2 some more text.
<br>Line 3 text,
<br>Line 4 text
<br></td></tr><tr><td> </td></tr>
...
Using this RegEx: <td\ class="td_my_custom_text">[\s\S]*?</td> I have managed to get something close but not close enough. <td class="td_my_custom_text">, <br> and </td> are still inside and I am stuck.
- What needs to be changed in my regular expression to get rid of them?
- Is there some Windows tool to automate this job and copy just extracted data to new file(s)? I have 5000+ files like this one and I am thinking about making a small program using regex or html parser but I would like to know if there is a better approach first.
It looks you’re better off just stripping off the tags because that’s essentially what you’re doing.
You should also look at dasbinkenlight’s link in his comment to understand more about HTML parsing.