I’m a newbie to Python and programming in general so please excuse me if the question is very dumb.
I’ve been following this tutorial on RSS scraping step by step but I am getting a “list index out of range” error from Python when trying to gather the corresponding links to the titles of the articles being gathered.
Here is my code:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
source = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')
find_title = re.findall(title, source)
find_link = re.findall(link, source)
literate = []
literate[:] = range(1, 16)
for i in literate:
print find_title[i]
print find_link[i]
It executes fine when I only tell it to retrieve titles, but immediately throws an index error when I would like to retrieve titles and their corresponding links.
Any assistance will be greatly appreciated.
I think you are using a wrong regex for extracting link from your page.
Take a look at
html sourceof your page you will find that the links are not enclosed in<link></link>pattern.Actually the pattern is
<link rel="alternate" type="text/html" href= links here.That’s the reason why your regex is not working.