I am trying to parse a website for
blahblahblah
<a href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah
(there are many of these, and I want all of them in some tokenized form). The problem is that “a href” actually has two spaces, not just one (there are some that are “a href” with one space that I do NOT want to retrieve), so using tree.xpath(‘//a/@href’) doesn’t quite work. Does anyone have any suggestions on what to do?
Thanks!
This code works as expected :
Edit : it’s not possible AFAIK to do what you want with
lxml.You can use a regex instead.