ANTLR is a very good choice. It's a bit more…

Question

0

Asked: May 10, 20262026-05-10T19:57:33+00:00 2026-05-10T19:57:33+00:00

Given an HTML link like <a href=urltxt class=someclass close=true>texttxt</a> how can I isolate the

0

Given an HTML link like

<a href='urltxt' class='someclass' close='true'>texttxt</a>

how can I isolate the url and the text?

Updates

I’m using Beautiful Soup, and am unable to figure out how to do that.

I did

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))  links = soup.findAll('a')  for link in links:     print 'link content:', link.content,' and attr:',link.attrs

i get

*link content: None  and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root    /support.asp')]*  ... ...

Why am i missing the content?

edit: elaborated on ‘stuck’ as advised 🙂

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T19:57:34+00:00

Use Beautiful Soup. Doing it yourself is harder than it looks, you’ll be better off using a tried and tested module.

EDIT:

I think you want:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

By the way, it’s a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

EDIT 2:

This should show you all the links in a page:

import urlparse, urllib from BeautifulSoup import BeautifulSoup  url = 'http://www.example.com/index.html' source = urllib.urlopen(url).read()  soup = BeautifulSoup(source)  for item in soup.fetchall('a'):     try:         link =  urlparse.urlparse(item['href'].lower())     except:         # Not a valid link         pass     else:         print link

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions