I am currently running the following code based on Chapter 12.5 of the Python Cookbook:
from xml.parsers import expat class Element(object): def __init__(self, name, attributes): self.name = name self.attributes = attributes self.cdata = '' self.children = [] def addChild(self, element): self.children.append(element) def getAttribute(self,key): return self.attributes.get(key) def getData(self): return self.cdata def getElements(self, name=''): if name: return [c for c in self.children if c.name == name] else: return list(self.children) class Xml2Obj(object): def __init__(self): self.root = None self.nodeStack = [] def StartElement(self, name, attributes): element = Element(name.encode(), attributes) if self.nodeStack: parent = self.nodeStack[-1] parent.addChild(element) else: self.root = element self.nodeStack.append(element) def EndElement(self, name): self.nodeStack.pop() def CharacterData(self,data): if data.strip(): data = data.encode() element = self.nodeStack[-1] element.cdata += data def Parse(self, filename): Parser = expat.ParserCreate() Parser.StartElementHandler = self.StartElement Parser.EndElementHandler = self.EndElement Parser.CharacterDataHandler = self.CharacterData ParserStatus = Parser.Parse(open(filename).read(),1) return self.root
I am working with XML documents of about 1 GB in size. Does anyone know a faster way to parse these?
I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.
Note however, Fredriks advice on using cElementTree iterparse function:
The lxml.iterparse() does not allow this.
The previous does not work on Python 3.7, consider the following way to get the first element.