I’m able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.
from xml.dom import minidom xml = '''<?xml version='1.0' ?> <ProductData> <ITEM Id='0471195'> <Category> <![CDATA[Homogenizers]]> </Category> <Image> 0471195.jpg </Image> </ITEM> <ITEM Id='0471195'> <Category> <![CDATA[Homogenizers]]> </Category> <Image> 0471196.jpg </Image> </ITEM> </ProductData> ''' bad_xml_item_count = 0 data = {} xml_data = minidom.parseString(xml).getElementsByTagName('ProductData') parts = xml_data[0].getElementsByTagName('ITEM') for p in parts: try: part_id = p.attributes['Id'].value.strip() except(KeyError): bad_xml_item_count += 1 continue if not part_id: bad_xml_item_count += 1 continue part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip() part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip() print '\t'.join([part_id, part_category, part_image])
minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it’s worth, but minidom is much older than DOM L3.)
So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.
What you probably want is the textual data of all children of Category. In DOM Level 3 Core you’d just call:
but minidom doesn’t support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way: