I’ve been bashing my head at this for ages, I must be doing something

Question

0

Asked: May 19, 20262026-05-19T11:06:42+00:00 2026-05-19T11:06:42+00:00

I’ve been bashing my head at this for ages, I must be doing something

0

I’ve been bashing my head at this for ages, I must be doing something stupid.

I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias.

Here is my python code so far, which is simply trying to retrieve one of the tables:

import httplib
from lxml import etree

def main():
    conn = httplib.HTTPConnection("meta.wikimedia.org")
    conn.request("GET","/wiki/List_of_Wikipedias")
    res = conn.getresponse()
    root = etree.fromstring(res.read())
    table = root.xpath('//table')
    print table

main()

On my machine this only prints an empty list. To increase speed I cached the page locally and used:

wikipage = open("wikipage.html")
root = lxml.parse(wikipage)

but this makes no impact whatsoever (other than the obvious speedup). I have also tried

lxml.find('table')

and:

for element in root.iter():
    print("%s - %s" % (element.tag, element.text))

which successfully prints out all of the elements, so I know the tree is being created.

What am I doing wrong?

Any help would be appreciated.
Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T11:06:43+00:00

I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias

Your problem is that the element names in the document are in a default namespace. How to write XPath expressions that involve such element names is the most FAQ in XPath and has numerous good answer in the SO xpath tag. Just search for them.

Here is a complete solution:

Use:

(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()

where you have registered the XHTML namespace ("http://www.w3.org/1999/xhtml") bound to the prefix "x".

When I evaluated this XPath expression against the document obtained from: http://s23.org/wikistats/wikipedias_html

I needed to add the following at the start of the document, because I was working locally and didn’t have the DTD for XHTML — maybe you will not need these:

<!DOCTYPE html [
<!ENTITY uarr "&#8593;">
<!ENTITY darr "&#8595;">
<!ENTITY ccedil "&#199;">
<!ENTITY oslash "&#216;">
<!ENTITY aacute "&#225;">
<!ENTITY aring "&#229;">
<!ENTITY agrave "&#192;">
<!ENTITY egrave "&#232;">
<!ENTITY ograve "&#210;">
<!ENTITY ocirc "&#244;">
]>

The result of applying the above XPath expression to this document is:

                    English

                    German

                    French

                    Polish

                    Italian

                    Japanese

                    Spanish

                    Portuguese

                    Dutch

                    Russian

                    Swedish

                    Chinese

                    Catalan

                    Norwegian (Bokmål)

                    Finnish

                    Ukrainian

                    Czech

                    Hungarian

                    Romanian

                    Korean

                    Turkish

                    Vietnamese

                    Indonesian

                    Danish

                    Arabic

                    Esperanto

                    Serbian

                    Lithuanian

                    Slovak

                    Volapük

                    Persian

                    Hebrew

                    Bulgarian

                    Slovenian

                    Malay

                    Waray-Waray

                    Croatian

                    Estonian

                    Newar / Nepal Bhasa

                    Simple English

                    Hindi

                    Galician

                    Thai

                    Basque

                    Norwegian (Nynorsk)

                    Aromanian

                    Greek

                    Haitian

                    Azerbaijani

                    Tagalog

                    Latin

                    Telugu

                    Georgian

                    Macedonian

                    Cebuano

                    Serbo-Croatian

                    Breton

                    Piedmontese

                    Marathi

                    Latvian

                    Luxembourgish

                    Javanese

                    Belarusian (Taraškievica)

                    Welsh

                    Icelandic

                    Bosnian

                    Albanian

                    Tamil

                    Belarusian

                    Bishnupriya Manipuri

                    Aragonese

                    Occitan

                    Bengali

                    Swahili

                    Ido

                    Lombard

                    West Frisian

                    Gujarati

                    Afrikaans

                    Low Saxon

                    Malayalam

                    Quechua

                    Sicilian

                    Urdu

                    Kurdish

                    Cantonese

                    Sundanese

                    Asturian

                    Neapolitan

                    Samogitian

                    Armenian

                    Yoruba

                    Irish

                    Chuvash

                    Walloon

                    Nepali

                    Ripuarian

                    Western Panjabi

                    Kannada

                    Tajik

                    Tarantino

                    Venetian

                    Yiddish

                    Scottish Gaelic

                    Tatar

                    Min Nan

                    Ossetian

                    Uzbek

                    Alemannic

                    Kapampangan

                    Sakha

                    Egyptian Arabic

                    Kazakh

                    Maori

                    Limburgian

                    Amharic

                    Nahuatl

                    Upper Sorbian

                    Gilaki

                    Corsican

                    Gan

                    Mongolian

                    Scots

                    Interlingua

                    Central_Bicolano

                    Burmese

                    Faroese

                    Võro

                    Dutch Low Saxon

                    Sinhalese

                    Turkmen

                    West Flemish

                    Sanskrit

                    Bavarian

                    Malagasy

                    Manx

                    Ilokano

                    Divehi

                    Norman

                    Pangasinan

                    Banyumasan

                    Sorani

                    Romansh

                    Northern Sami

                    Zazaki

                    Mazandarani

                    Wu

                    Friulian

                    Uyghur

                    Ligurian

                    Maltese

                    Bihari

                    Novial

                    Tibetan

                    Anglo-Saxon

                    Kashubian

                    Sardinian

                    Classical Chinese

                    Fiji Hindi

                    Khmer

                    Ladino

                    Zamboanga Chavacano

                    Pali

                    Franco-Provençal/Arpitan

                    Pashto

                    Hakka

                    Cornish

                    Punjabi

                    Navajo

                    Silesian

                    Kalmyk

                    Pennsylvania German

                    Hawaiian

                    Saterland Frisian

                    Interlingue

                    Somali

                    Komi

                    Karachay-Balkar

                    Crimean Tatar

                    Tongan

                    Acehnese

                    Meadow Mari

                    Picard

                    Erzya

                    Lingala

                    Kinyarwanda

                    Extremaduran

                    Guarani

                    Kirghiz

                    Emilian-Romagnol

                    Assyrian Neo-Aramaic

                    Papiamentu

                    Aymara

                    Chechen

                    Lojban

                    Wolof

                    Banjar

                    Bashkir

                    North Frisian

                    Greenlandic

                    Tok Pisin

                    Udmurt

                    Kabyle

                    Tahitian

                    Sranan

                    Zealandic

                    Hill Mari

                    Komi-Permyak

                    Lower Sorbian

                    Abkhazian

                    Gagauz

                    Igbo

                    Oriya

                    Lao

                    Kongo

                    Avar

                    Moksha

                    Mirandese

                    Romani

                    Old Church Slavonic

                    Karakalpak

                    Samoan

                    Moldovan

                    Tetum

                    Gothic

                    Kashmiri

                    Bambara

                    Inupiak

                    Sindhi

                    Bislama

                    Lak

                    Nauruan

                    Norfolk

                    Inuktitut

                    Pontic

                    Assamese

                    Cherokee

                    Min Dong

                    Swati

                    Palatinate German

                    Hausa

                    Ewe

                    Tigrinya

                    Oromo

                    Zulu

                    Zhuang

                    Venda

                    Tsonga

                    Kirundi

                    Dzongkha

                    Sango

                    Cree

                    Chamorro

                    Luganda

                    Buginese

                    Buryat (Russia)

                    Fijian

                    Chichewa

                    Akan

                    Sesotho

                    Xhosa

                    Fula

                    Tswana

                    Kikuyu

                    Tumbuka

                    Shona

                    Twi

                    Cheyenne

                    Ndonga

                    Sichuan Yi

                    Choctaw

                    Marshallese

                    Afar

                    Kuanyama

                    Hiri Motu

                    Muscogee

                    Kanuri

                    Herero

Do note: Every second selected node is a white-space-only text node. If you don’t want these selected, use:

(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve been bashing my head at this for ages, I must be doing something

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply