I’m trying to write a python program that will grab and display any rss updates since the last time the program was run. I am using feedparser and trying to use etags and last modified as described here on SO but my test script seems to not be working.
import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss","http://mrjakeparker.com/feed/"]
for feed in rsslist:
print('--------'+feed+'-------')
d=feedparser.parse(feed)
print(len(d.entries))
if (len(d.entries) > 0):
etag=d.feed.get('etag','')
modified=d.get('modified',d.get('updated',d.entries[0].get('published','no modified,update or published fields present in rss')))
d2=feedparser.parse(feed,modified)
if (len(d2.entries) > 0):
etag2=d2.feed.get('etag','')
modified2=d2.get('updated',d.entries[0].get('published',''))
if (d2==d): #ideally we would never see this bc etags/last modified would prevent unnecessarily downloading what we all ready have.
print("Arrg these are the same")
I’m honestly not sure if rss/xml technology has changed from the references I’ve been using online or if there is a problem with my code.
Regardless I’m looking for a best solution to efficiently use rss feeds. As it stands I’m looking to minimize bandwidth waste such as that which is intended by use of last-modified and the etags fields.
Thanks in advance.
Your issue is that you are passing in the last modified date in place of the
etag. Theetagis the second argument to theparse()method,modifiedis the third argument.Instead of:
Do:
After taking a look at the source code, it looks like the only thing passing
etagormodifiedto theparse()function does is send the appropriate headers to the server so that the server can return an empty response if nothing has changed. If the server does not support this then the server will just return the full RSS feed. I would modify your code to check the dates of each entry and ignore one with a date that is smaller than the max date in the previous request:This produces: