I’m looking to use Python and xml.dom.minidom to get a list of links within

Question

0

Asked: May 13, 20262026-05-13T10:41:58+00:00 2026-05-13T10:41:58+00:00

I’m looking to use Python and xml.dom.minidom to get a list of links within

0

I’m looking to use Python and xml.dom.minidom to get a list of links within a particular <table> specified by the table id. Based on some excellent advice, I’m trying to use the DOM instead of pattern matching.

import urllib
import xml.dom.minidom

url = 'http://www.batstrading.com/market_data/shortsales'
page = xml.dom.minidom.parse(urllib.urlopen(url))

I can get all the links by the tag name page.getElementsByTagName('a'), but I cannot limit the links returned by those only contained within the table with ID “monthly-short-sale”.
Using getElementById returns None.

Is this because the “monthly-short-sale” ID is not defined within the DTD? If so, what would be the best way to extract this information?

Here is the code that I’m currently using, which works, but sins against god:

import urllib
import xml.dom.minidom
import datetime

url = 'http://www.batstrading.com/market_data/shortsales'

def getDownloadLink(alink, prefix = 'BATSsh'):
    """return (datetime.date, link) for the provided link if the link
    target starts with the data file prefix"""

    n = len(prefix)
    href = alink.getAttribute('href')
    if href.startswith(prefix) and (len(href) == 25):
        year = int(href[n:n+4])
        month = int(href[n+4:n+6])
        day = int(href[n+6:n+8])
        date = datetime.date(year, month, day)
        return (date, url + '/' + href)

page = xml.dom.minidom.parse(urllib.urlopen(url))
link = (getDownloadLink(a) for a in page.getElementsByTagName('a'))
link = dict(i for i in link if i is not None)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T10:41:59+00:00

The problem is that minidom is a non-external-entity-reading XML parser. That means it doesn’t even look at the DTD, so it doesn’t know that in HTML the attribute with the name id corresponds to an ID schema type.

A further consequence of this is that minidom won’t know about the HTML-specific entities like é that are defined in the XHTML doctype, so you may lose text that way.

If you don’t care about this, you can continue using minidom and using an alternative way to get at the table, involving getElementsByTagName and checking element.id manually. (You could hack up your own getElementById function to do it the slow way.)

Or you could use an XML parser that does allow external entities such as pxdom. However this means the parser will have to fetch and parse the DTD from W3 each time, which will be unpleasantly slow.

Or you could go for an HTML parser, which has the HTML entities and ID-nesses built in, such as BeautifulSoup. This might be a better idea when you are dealing with real-world HTML pages served as text/html, which though they may claim to be XHTML often includes naughty bits that aren’t well-formed.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m looking to use Python and xml.dom.minidom to get a list of links within

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply