I am trying to write an xml parser using BeautifulSoup4 in Python. For some reason, the document is not being parsed correctly. My xml document is shown below:
<module id="BrainParser_1" name="Brain Parser" package="CCB" version="1" location="pipeline://cranium.loni.ucla.edu//usr/local/loniWorkflows/BrainParser/brainparser.sh" sourceCode="" icon="/9j/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAx
NDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIy
MjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAAUCABIAFYEASIAAhEBAxEBBCIA/8QAHwAAAQUB
AQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEG
E1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLD
xMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAA
AAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKR
obHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hp
anN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU
1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADgQBAAIRAxEEAAA/APn+iip7OzuNQuktbWIyzPna
g74GaEr7AfP9FFTWlrcX13Fa2sLzTysEjjQZLE9AKgrQt9C1e7RHt9KvplcblaO3dgw9Rgc16B4a
+HEMRabWV8+YAGOBThM57nv9OnXrXqn9qXs0cYitoItoAYfxAVr7GS1krDguZ6ENTQWlzckC3t5Z
STgCNC3P4V7b4G/Z/urp/tfi5mtoBgpaQyDe/wDvEfdH05r6B03TLLSLGKy0+2jtraJdqRxrgAV8
9ReCPEstuJho9wiE4/e4jOfoxBqZ/AHihOulP0zgSxk4+gavdLi6vLxwFRTHG3zAnlvpVdXeS7Mi
mQHaRsJxjkVzc75mmd6wtNpO7Piyy8CeK9RjaS18Pai6qcEmBl5/HFTS/DnxlDGXfw3qO0ekJP6C
vtiivnm90vUNNYre2VxbnO3MsZUE+xPXpVSvoiQSi7Ek6qdw2qAeQKwbjwV4d1DfPLb7JWbBaNiv
P0BxSVXuTLBv7LPga5tLmymaK6t5YJFJBSRCpBHXg1DX3lf6RpuqhBqFhbXQT7vnxB9v0zXm3iD4
BeFtYv3u7OW40wuOYoMGPPqAeleK0V1viXwHfaJMjWm++tpSQpjQl0PoQPbuP04zyVaJp7HHOEoO
0j5Wors/Hfw01vwHPGb3y7izmJEV1BkqT6EHoa4yiiiimSFFFFFeheAfDl/DqbXs0fljycKpHPJB
59On61j+BPDsWu6yDdjNpERkf327L/j+Fe7f2cLCAqkAU45wMZrXDTh7ZRY5QfJzBXtPwM8A6u/i
ay8UXdp5WlxRO8MjnmRiCowPxJyfSuP+FvgP/hPPE5trh3i0+1Tzbl1HLDOAgPYn+QNfXmnafa6T
p1vp9lEIbW3QRxRgk7VH1qpFGJEwo4PJO7BBqwLRGVGk+ZsYB3GlihRY0IiCuevFXFUAYx+dduJn
Zcpyyk1sWqKKKpfZ0VsopU98d6rOmbrlN3yHI/EVoSSOr/LCSv8AeJwKzLzUbKzuA11c+UNh+6hP
ce1cKUW3cuNSv0bCiiika2idlk2srIeOTxSNa20h3SoOOcbsA0kOq6Ze5W21GNpOyyDbmpZFPCug
3Hseh+hrN04S2NoYqvSd5aoKKKKr77SSNkj4fkKM859q8+8Z+B1W2fUdMjzIgL3MYJZnJPUDPGOe
K9F6bX8pQQeMmkvY/PtpYThCw6gdawlF02elRqxxVN3RFcWtvdwmK5gjmjPVJFDD8jXzf8W/g6uh
RTeIfDyE6cCXubXOTBk/eX/Z56dvp0+laZNDFcQvDNGskTqVdHGQwPUEV840V0fjTQG0LWiFyYbg
eYhPr/EPz5/Gitk7q5584uEnFnwHRXU/EPwz/wAIl431DS0B8hX8yAnHMbcjp+X4UV6H8KbTytLg
uI7VJS5Z23yY+YMVBHHoBXq8tw723l3FkrBh9zzOn6V5L8M3lutCt4AuYIy6MUPzA7i3Pp1FelOi
6fZbXmZs9CzEkn8a5pR97Vfn/md9GHPFPoe5/s8aDDZ+DrjWt264v52Q+iohwB9c5Nex15J+z3q0
F34Ak05eJ7K6cOM9Q/zA/wAx+Fet1l392bOVfMgKpn5D5nB/SqsPie3Fy24RKAMAl/8A61WHhOrQ
lTNmDJBXPJrDvfB8P2bdbA+d1ILcEelelzVKkFzJHNUwnvc0Aooop03iS8vdTEFvLGsZPGwZJ/HF
T3Fi10+Jo95ZDy0hPcV0fhXw1pNrp0LtbpLKwBeRxyD6D0qfxDDpmmPFMpEQdSDlsAciuOnioSm4
2/r7zBxaCiiq97fWmm2j3V9cxW1un3pZXCqPqTXj82l3KXzQpEy/N8vBrtNAi1JLBkvox5an5PMb
Bq9Jf2zvsjCM4XcGyOlXre3ZYxPII2Y9Af51tzLt+YKLloixRWLY+L/DmpXy2VjrlhcXLDcIop1Z
iPbBraqk8MzrtCqCefv9P0pJ7SWTZI7bAvQ7+v6VppFJO5AAIz/DVXVpbbTIhLe3iKg5IbgDFRVb
kr8unz/zOrB01SvGUtWFFFZuta/pfh3T5b7Vb2K2gjUsS7cn2A6k/SvJ/inJvu9PQkEosgODn+7R
XM+K9ZOueILi5wvloTFFtOcoCcH8c5/GilHYxru9Rs+Zv2gYynxMZtynfZxHAOSOo5/KiuJ8Z+I5
PFfi3UdZcFVnlPlKf4Yxwo/ICirXhbxrqHhVZo7dVlhlO4oTghumQcHt/IV6/wCG7q91OC1vNYjM
d26bjEc/KO3HbPX2zivnuvVvAXjBLkRWN9Ltu4k2xs3/AC0UD19R3/P1xtRjHnu9zbD1HflbJ/B3
jvW/A95PcaRLHiddssUqbkb0OPUZr6c+EniPxF4o8JNqHiGFFYzFbeVU2GVMD5sfXIz3r4+r3z4J
/FS3tLaPwtr1yIkU4srmQ4UD/nmx7ex/D0r1WO1SOTzE2At17Zqjr2rjSrXfFbNNKRgYX5V+ppLb
V7dwCJFLMcdeallKzRyb04btXY9jrautD6DorO0/XtJ1WWSOw1K1uZInKOsUoYgjrx+NaNaPhC6m
u9CiuLlRukYtx6dq5T4xrbP4RuHcHzlCCPn/AKaJn9M12dmY7OwiijUKir0HavDvil40ttbuY9N0
ydJ7WMZkmXOC2TlRkcjgHIyDnrXz1C8q115nmy2CvJv2hkVvhzExkKlb6PC5+9w1eqXNzDZ2stzc
OI4YkLu56KoGSa+Vfi38Uo/HUtvYaZFLFpVsxfMoAaWTpnHYAdPqa85imlglWWGR45F6OjEEfiK6
+z+J/iO2jjilnjuIl4YOgDMPTI6flXG0V6SbWxmeZRSyQSrLFI0ciHKuhwQfUGu3sPjD4608Rquu
yzLH/DcIr5+pIyfzrhaK7rUvijq93amGzQWbN1kV9xH04GP1rkbvVtSv023moXdwvpNMzj9TVOin
KTluC0PRda+NvjTWtNFkb2OzBPzy2amOR/bdnj8MVwl9qd/qciyX99c3bqMBriVpCB6ZJqrRRRRR
UgFFFFFKrMjq6MVZTkEHBBoooAKKKK1tE8RXmiah9qTE+SS6SsxBJIJPX73HU5r1Ox8f6Zqdzawo
0m52zIhQgoP5H8KKKpTlyuPc0jVlFWRp+H9dvPDeu2mr2DKLi2feoblW9QfYivqHwT8afD/i27i0
6ZJNO1GReEmI8tz6K3r7ECiio/HvxEhg06TTNMaX7VOm1pB8vlL3OeufTH1+vjNFFY06UaatEhu5
xvxt+KUfk3PhHR2Vy4AvLlW+7zny1x34GT74r59oooooorQQUUUUUUUUAFFFFFFFFABRRRX/2Q==" posX="80" posY="70" rotation="1">
<authors>
<author fullName="Mubeena Mirza" email="" website="" />
</authors>
<executableAuthors>
<author fullName="Zhuowen Tu" email="" website="" />
<author fullName="Bruce Liu" email="" website="" />
</executableAuthors>
<metadata>
<data key="__creationDateKey" value="Tue Sep 11 10:28:28 PDT 2007" />
</metadata>
<input id="BrainParser_1.Structure" name="Structure" description="0: segmentation sub-cortical structures
1: sulci detection" required="false" enabled="true" order="0" prefix="-p" prefixSpaced="true" prefixAllArgs="false">
<format type="Enumerated" cardinality="1">
<enumeration>0</enumeration>
<enumeration>1</enumeration>
<enumeration>2</enumeration>
</format>
<values>
<value>2</value>
</values>
</input>
<input id="BrainParser_1.Testing" name="Testing" description="0: perform segmentation/detection
1: perform training
" required="false" enabled="true" order="1" prefix="-r" prefixSpaced="true" prefixAllArgs="false">
<format type="Enumerated" cardinality="1">
<enumeration>0</enumeration>
<enumeration>1</enumeration>
</format>
<values>
<value>0</value>
</values>
</input>
<input id="BrainParser_1.SourceFile" name="Source File" description="In testing, it points to the source file in training, it points directory in which the training volumes are saved.
" required="true" enabled="true" order="2">
<format type="File" cardinality="1">
<fileTypes>
<filetype name="Analyze Image" extension="img" description="Analyze Image">
<need>hdr</need>
</filetype>
<filetype name="Analyze Image" extension="img" description="Analyze Image file">
<need>hdr</need>
</filetype>
</fileTypes>
</format>
</input>
<output id="BrainParser_1.TargetFile" name="Target File" description="In testing, it points to the target file in training, it points directory in which the trained classifiers are saved.
" required="true" enabled="true" order="3">
<format type="File" cardinality="1">
<fileTypes>
<filetype name="Analyze Image" extension="img" description="Analyze Image">
<need>hdr</need>
</filetype>
</fileTypes>
</format>
</output>
<input id="BrainParser_1.ModelsDirectory" name="Models Directory" description="Directory of trained models." required="false" enabled="true" order="4" prefix="-m" prefixSpaced="true" prefixAllArgs="false">
<format type="Directory" cardinality="1" />
<values>
<value>pipeline://cranium.loni.ucla.edu//usr/local/loniWorkflows/BrainParser/56_Structure</value>
</values>
</input>
<input id="BrainParser_1.NumberofStructures" name="Number of Structures" description="Only effective in training." required="false" enabled="false" order="5" prefix="-n" prefixSpaced="true" prefixAllArgs="false">
<format type="Number" cardinality="1" />
<values>
<value>1</value>
</values>
</input>
<input id="BrainParser_1.NumberofIterations" name="Number of Iterations" required="false" enabled="false" order="6" prefix="-t" prefixSpaced="true" prefixAllArgs="false">
<format type="Number" cardinality="1" />
</input>
<input id="BrainParser_1.SmoothnessFactor" name="Smoothness Factor" description="Defalut=0.5, typical 0.0~2.0." required="true" enabled="true" order="7" prefix="-s" prefixSpaced="true" prefixAllArgs="false">
<format type="Number" cardinality="1" />
<values>
<value>2.0</value>
</values>
</input>
</module>
The Python code I’ve written is shown below:
if __name__ == '__main__':
soup = BeautifulSoup (
open('test.xml'),
'lxml'
)
for e in soup.find_all('module',attrs={'name':'Brain Parser'}):
for i in e.find_all('input'):
print i.prettify()
And this is the result:
<input description="0: segmentation sub-cortical structures 1: sulci detection" enabled="true" id="BrainParser_1.Structure" name="Structure" order="0" prefix="-p" prefixallargs="false" prefixspaced="true" required="false"/>
<input description="0: perform segmentation/detection 1: perform training" enabled="true" id="BrainParser_1.Testing" name="Testing" order="1" prefix="-r" prefixallargs="false" prefixspaced="true" required="false"/>
<input description="In testing, it points to the source file in training, it points directory in which the training volumes are saved. " enabled="true" id="BrainParser_1.SourceFile" name="Source File" order="2" required="true"/>
<input description="Directory of trained models." enabled="true" id="BrainParser_1.ModelsDirectory" name="Models Directory" order="4" prefix="-m" prefixallargs="false" prefixspaced="true" required="false"/>
<input description="Only effective in training." enabled="false" id="BrainParser_1.NumberofStructures" name="Number of Structures" order="5" prefix="-n" prefixallargs="false" prefixspaced="true" required="false"/>
<input enabled="false" id="BrainParser_1.NumberofIterations" name="Number of Iterations" order="6" prefix="-t" prefixallargs="false" prefixspaced="true" required="false"/>
<input description="Defalut=0.5, typical 0.0~2.0." enabled="true" id="BrainParser_1.SmoothnessFactor" name="Smoothness Factor" order="7" prefix="-s" prefixallargs="false" prefixspaced="true" required="true"/>
As you can see, it thinks that input has no child elements, but this is not the case. I did some poking around, and it seems that elements like value and format are parsed as children of the module element. Can anybody help with this?
You are calling
BeautifulSoupwith"lxml", which tells it to use thelxmlparser and parse the input as HTML. (In HTML,inputtags are self-closing and don’t have children, so your string is not valid HTML. BeautifulSoup does its magic HTML fixing and decides that you meant theinputtag to close itself immediately, which is why you are not seeing any children.)You want to call it with
"xml", which tells it that the input is an XML document.