I wanted to ask what known existing Python 2.x libraries there are for parsing an XML document with built-in DTD without automatically expanding the entities. (File in question for those curious: JMdict.)
It seems lxml has some option for not parsing the entities, but last I tried, the entities just ended up being converted to blanks. I just googled this and found pxdom as another alternative which I may try, but since it’s pure Python it seems far slower than I’d like.
Anything else out there?
It seems that the use case is rather abnormal; not expanding entities seems to go against the way parsers are generally supposed to work according to the XML spec.
So, I think it’s easiest to just kludge this perhaps. I’ve manually extracted the tags via re.finditer, and have made a dictionary of the mappings. From here, it’s just a matter of scanning the parsed output and doing the right thing for my app. Good enough for my use case I think.