I am trying to parse a website for blahblahblah <a href=THIS IS WHAT I

Question

0

Asked: June 18, 20262026-06-18T19:57:31+00:00 2026-06-18T19:57:31+00:00

I am trying to parse a website for blahblahblah <a href=THIS IS WHAT I

0

I am trying to parse a website for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah

(there are many of these, and I want all of them in some tokenized form). The problem is that “a href” actually has two spaces, not just one (there are some that are “a href” with one space that I do NOT want to retrieve), so using tree.xpath(‘//a/@href’) doesn’t quite work. Does anyone have any suggestions on what to do?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T19:57:33+00:00

Editorial Team

2026-06-18T19:57:33+00:00Added an answer on June 18, 2026 at 7:57 pm

This code works as expected :

from lxml import etree

file = "file:///path/to/file.html" # can be a http URL too
doc = etree.parse(file)

print doc.xpath('//a/@href')[0]

Edit : it’s not possible AFAIK to do what you want with lxml.

You can use a regex instead.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse a website for blahblahblah <a href=THIS IS WHAT I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply