I am parsing several XML document feeds with BeautifulSoup, and would like to do some preprocessing to replace non-standard CDATA tags with custom XML tags. To illustrate:
The following XML source…
<title>The end of the world as we know it</title>
<category><![CDATA[Planking Dancing]]></category>
<pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate>
<dc:creator><![CDATA[Bart Simpson]]></dc:creator>
…would turn into:
<title>The end of the world as we know it</title>
<category><myTag>Planking Dancing<myTag></category>
<pubDate><myTag>Sun, 16 Sep 2012 12:00:00 EDT<myTag></pubDate>
<dc:creator><myTag>Bart Simpson<myTag></dc:creator>
I don’t think this question has been asked before on SO (I tried a few different SO queries). I’ve also tried a few different approaches using .findAll('cdata', text=True) and the applying the BeautifulSoup replaceWith() method to each resulting NavigableString. The attempts I’ve made have resulted in either no substitution, or what looks like a recursive loop.
I’m happy to post my previous attempts, but given that the problem here is quite simple I’m hoping someone can post a clear example of how to accomplish the search-and-replace above using BeautifulSoup 3.
CDatais a subclass ofNavigableString, so you can find allCDataelements by first searching for all
NavigableStringobjects, and then testingwhether each is an instance of
CData. Once you’ve got one, it’s easilyreplaced using
replaceWith, as you suggested:A couple of notes:
you can call a
BeautifulSoupobject as though it were a function, and theeffect is the same as calling its
.findAll()method.The only way I know to get the content of a
CDataobject in BS3 is to sliceit, as in the snippet above.
str(navstr)would keep all the<![CDATA[...]]>junk, which obviously you don’t want. In BS4,str(navstr)gives you the content without the junk.