I’m trying to get some information from a web site. The information I want is in a table so I made a regex but I don’t know the right way to simplify it.
The following are two parts of my regex that I would like to simplify:
<br>(.*)<br>(.*)<br>(.*)
<tr><td>(.+)r>(.+)r>(.+)r>(.+).+</td></tr> # This part should be repeated n times(n = 1 to 10)
I looked through the python documentation and I can’t realize how to do it. Perhaps you can give me a hint.
Thank you,
mF.
This is the wrong way to go unless you’re trying to scrape some data out of a tiny fragment.
It would be much better if you used a tolerant HTML. BeautifulSoup mentioned earlier is a good one but it’s stagnating and I don’t believe it’s being maintained actively anymore.
A highly recommended parser for Python is lxml.
There was a long thread discussing parsing XHTML on one of our local mailing lists here which you might find useful too.