I have some XML that I want to parse using the lxml method in python. Having parsed the elements I want to be able to compare some structured objects (looking for delta).
There is about 50 XML files I need to parse, and the data in the XML is in an ‘uneven’ form (I’m not sure what the correct name is).
Simplified Example XML:
<ID 1>
<parameter A>
<parameter B>
</ID 1>
<ID 2>
<parameter A>
<parameter B>
<parameter C>
</ID 2>
<ID 3>
<parameter A>
</ID 3>
How would I go about creating a suitable db (mySQL?) structure that I can use to isolate each object via ID, and compare each of the parameter elements.
I’m not sure if this makes sense – I’m not hugely au fait with the correct terminology.
The actual source xml is all the files listed here: http://www.nationalarchives.gov.uk/aboutapps/pronom/droid-signature-files.htm
These files are versions of the same structure that have been updated over past few years. I don’t need all the XML elements in the DB, just a subset, starting with a version number, release date and then the individual ID’s and byte patterns found in the two primary sections.
Pushing it into mysql may not be the best way forward, but I figured if I did that I would then use a python/html front end to put together a search/comparison tool.
The key phrase for me in your question is: ‘I don’t need all the XML elements in the DB, just a subset’
Given that you can know upfront all the elements of the subset that you want to compare I suggest a single table with a column for each data element. This should make it easier to handle your later reporting requirement on the data.
The other alternative of storing the elements row-wise is generally considered an anti-pattern and will make the reporting and comparison significantly harder. If you don’t know beforehand the type (or number) of elements to compare this sort of a strategy mught be required.
EDIT: To be more explicit I was thinking the table would have columns:
ID,Paramater1,Parameter2,Parameter3,Parameter4
where parameterx is one of the ‘comparable parameters’ you were looking at – for many of these the column might be left null because no such parameter exists.
Then there would be only one table over all and one row in that table for each ID.