I’m making an xml parser to parse xml reports from different tools, and each tool generates different reports with different tags.
For example:
Arachni generates an xml report with <arachni_report></arachni_report> as tree root tag.
nmap generates an xml report with <nmaprun></nmaprun> as tree root tag.
I’m trying not to parse the entire file unless it’s a valid report from any of the tools I want.
First thing I thought to use was ElementTree, parse the entire xml file (supposing it contains valid xml), and then check based on the tree root if the report belongs to Arachni or nmap.
I’m currently using cElementTree, and as far as I know getroot() is not an option here, but my goal is to make this parser to operate with recognized files only, without parsing unnecessary files.
By the way, I’m Still learning about xml parsing, thanks in advance.
“simple string methods” are the root [pun intended] of all evil — see examples below.
Update 2 Code and output now show that proposed regexes also don’t work very well.
Use ElementTree. The function that you are looking for is
iterparse. Enable “start” events. Bale out on the first iteration.Code:
Above ElementTree-related code works with Python 2.5 to 2.7. Will work with Python 2.2 to 2.4; you just need to get ElementTree and cElementTree from effbot.org and do some conditional importing. Should work with any lxml version.
Output:
Update 1 The above was demonstration code. Below is more like implementation code… just add exception handling. Tested with Python 2.7 and 2.2.