Sometimes, I want to parse HTML to extract URLs.
I find [html.parser.HTMLParser] and [re.match] both can do the job.
I want to know which is faster.
Is there a python-module like jquery to parse HTML?
If you have better solution, please leave a comment.
Thanks
lxml is very good.
it make the job really simple.
>>>for url in parse(urlopen('http://www.stackoverflow.com')).getroot().find_class('question-hyperlink'): print(url.get('href'))
I would strongly suggest lxml. In my experience, it is the fastest. lxml will actually generate a tree in memory. So you can parse/serialize/…
On the other hand, if you have to pick among the mentioned two options, I’d suggest you use the timeit module and determine it.