I’ve been bashing my head at this for ages, I must be doing something stupid.
I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias.
Here is my python code so far, which is simply trying to retrieve one of the tables:
import httplib
from lxml import etree
def main():
conn = httplib.HTTPConnection("meta.wikimedia.org")
conn.request("GET","/wiki/List_of_Wikipedias")
res = conn.getresponse()
root = etree.fromstring(res.read())
table = root.xpath('//table')
print table
main()
On my machine this only prints an empty list. To increase speed I cached the page locally and used:
wikipage = open("wikipage.html")
root = lxml.parse(wikipage)
but this makes no impact whatsoever (other than the obvious speedup). I have also tried
lxml.find('table')
and:
for element in root.iter():
print("%s - %s" % (element.tag, element.text))
which successfully prints out all of the elements, so I know the tree is being created.
What am I doing wrong?
Any help would be appreciated.
Thanks.
Your problem is that the element names in the document are in a default namespace. How to write XPath expressions that involve such element names is the most FAQ in XPath and has numerous good answer in the SO xpath tag. Just search for them.
Here is a complete solution:
Use:
where you have registered the XHTML namespace (
"http://www.w3.org/1999/xhtml") bound to the prefix"x".When I evaluated this XPath expression against the document obtained from: http://s23.org/wikistats/wikipedias_html
I needed to add the following at the start of the document, because I was working locally and didn’t have the DTD for XHTML — maybe you will not need these:
The result of applying the above XPath expression to this document is:
Do note: Every second selected node is a white-space-only text node. If you don’t want these selected, use: