I am trying to use regex to parse a site for blahblahblah <a href=THIS

Question

0

Asked: June 18, 20262026-06-18T21:46:24+00:00 2026-06-18T21:46:24+00:00

I am trying to use regex to parse a site for blahblahblah <a href=THIS

0

I am trying to use regex to parse a site for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah

(there are many of these, and I want all of them in some tokenized form). The problem is that “a href” actually has TWO spaces, not just one (there are some that are “a href” with one space that I do NOT want to retrieve), so using LXML has proven to be quite a pain and I do not want to use BeautifulSoup (for other reasons). Does anyone know how I might go about doing this?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T21:46:25+00:00

I believe this answers your question. It is just a couple of regular expressions that will get all of the href’s that are exactly two spaces after an opening ‘a’ tag.

fh = open("index.html", 'r')
rawString = fh.read()   # read entire file to string
fh.close()

temp =  re.findall("<a  href=\".*?\"", rawString) 
if temp:
    for i in range(len(temp)): # process each match
        temp[i] = re.search("\".*?\"", temp[i]).group(0) # remove 'href='
    print temp    
else:
    print "Not found"

For your example the output is:

[‘”THIS IS WHAT I WANT”‘]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to use regex to parse a site for blahblahblah <a href=THIS

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply