Hia
I am having problems parsing an rss feed from stackexchange in python.
When I try to get the summary nodes, an empty list is return
I have been trying to solve this, but can’t get my head around.
Can anyone help out?
thanks
a
In [3o]: import lxml.etree, urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds'
In [32]: cooking_content = urllib2.urlopen(url_cooking)
In [33]: cooking_parsed = lxml.etree.parse(cooking_content)
In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')
In [35]: cooking_texts
Out[35]: []
Take a look at these two versions
As you discovered, the second version returns no nodes, but the
lxml.htmlversion works fine. Theetreeversion is not working because it’s expecting namespaces and thehtmlversion is working because it ignores namespaces. Part way down http://lxml.de/lxmlhtml.html, it says “The HTML parser notably ignores namespaces and some other XMLisms.”Note when you print the root node of the etree version (
print(data.getroot())), you get something like<Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>. That means it’s a feed element with a namespace ofhttp://www.w3.org/2005/Atom. Here is a corrected version of the etree code.