I am trying to parse the XML returned by search engine APIs (Bing, Yahoo & Blekko). The returned XML (for sample search query ‘sushi’) from Blekko takes the form:
<rss version="2.0">
<channel>
<title>blekko | rss for "sushi/rss /ps=100"</title>
<link>http://blekko.com/?q=sushi%2Frss+%2Fps%3D100</link>
<description>Blekko search for "sushi/rss /ps=100"</description>
<language>en-us</language>
<copyright>Copyright 2011 Blekko, Inc.</copyright>
<docs>http://cyber.law.harvard.edu/rss/rss.html</docs>
<webMaster>webmaster@blekko.com</webMaster>
<rescount>3M</rescount>
<item>
<title>Sushi - Wikipedia</title>
<link>http://en.wikipedia.org/wiki/Sushi</link>
<guid>http://en.wikipedia.org/wiki/Sushi</guid>
<description>Article about sushi, a food made of vinegared rice combined with various toppings or fillings. Sushi ( すし、寿司, 鮨, 鮓, 寿斗, 寿し, 壽司.</description>
</item>
</channel>
</rss>
The section of python code to extract the required search result data is:
for counter in range(100):
try:
for item in BlekkoSearchResultsXML.getElementsByTagName('item'):
Blekko_PageTitle = item.getElementsByTagName('title')[counter].toxml(encoding="utf-8")
Blekko_PageDesc = item.getElementsByTagName('description')[counter].toxml(encoding="utf-8")
Blekko_DisplayURL = item.getElementsByTagName('guid')[counter].toxml(encoding="utf-8")
Blekko_URL = item.getElementsByTagName('link')[counter].toxml(encoding="utf-8")
print "<h2>" + Blekko_PageTitle + "</h2><br />"
print Blekko_PageDesc + "<br />"
print Blekko_DisplayURL + "<br />"
print Blekko_URL + "<br />"
except IndexError:
break
The code will not extract the Page Title of each search result returned, but does extract the rest of the info.
Furthermore, if I do not have the code:
print "<title>Page title to appear on browser tab</title>"
somewhere in the script, the title from the first search result is taken as the page title (i.e. the page appears with the title ‘Sushi – Wikipedia’ in the browser). If I do have a page title, the code still does not extract the page title from the search result.
The same code (with different tag names etc.) has the same problem with the Yahoo search API, but works fine with the Bing search API.
I guess that the .toxml() method returns the XML for the element, including its delimiting tags, and then you’re getting something like this:
The
titleelement is therefore interpreted as the page’s title, unless you specify your own in advance. Other elements are unknown to the browser, and it just displays their content as is.