I’m able to get the value in the image tag (see XML below), but

Question

0

Asked: May 11, 20262026-05-11T08:54:37+00:00 2026-05-11T08:54:37+00:00

I’m able to get the value in the image tag (see XML below), but

0

I’m able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

from xml.dom import minidom  xml = '''<?xml version='1.0' ?> <ProductData>     <ITEM Id='0471195'>         <Category>             <![CDATA[Homogenizers]]>                 </Category>         <Image>             0471195.jpg         </Image>     </ITEM>     <ITEM Id='0471195'>         <Category>             <![CDATA[Homogenizers]]>                 </Category>         <Image>             0471196.jpg         </Image>     </ITEM> </ProductData> '''  bad_xml_item_count = 0 data = {} xml_data = minidom.parseString(xml).getElementsByTagName('ProductData') parts = xml_data[0].getElementsByTagName('ITEM') for p in parts:     try:         part_id = p.attributes['Id'].value.strip()     except(KeyError):         bad_xml_item_count += 1         continue     if not part_id:         bad_xml_item_count += 1         continue     part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()     part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()     print '\t'.join([part_id, part_category, part_image])

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T08:54:37+00:00

p.getElementsByTagName(‘Category’)[0].firstChild

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it’s worth, but minidom is much older than DOM L3.)

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you’d just call:

p.getElementsByTagName('Category')[0].textContent

but minidom doesn’t support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

p.getElementsByTagName('Category')[0].firstChild.wholeText

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m able to get the value in the image tag (see XML below), but

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply