I am parsing several XML document feeds with BeautifulSoup, and would like to do

Question

0

Asked: June 14, 20262026-06-14T17:03:15+00:00 2026-06-14T17:03:15+00:00

I am parsing several XML document feeds with BeautifulSoup, and would like to do

0

I am parsing several XML document feeds with BeautifulSoup, and would like to do some preprocessing to replace non-standard CDATA tags with custom XML tags. To illustrate:

The following XML source…

<title>The end of the world as we know it</title>
<category><![CDATA[Planking Dancing]]></category>
<pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate>
<dc:creator><![CDATA[Bart Simpson]]></dc:creator>

…would turn into:

<title>The end of the world as we know it</title>
<category><myTag>Planking Dancing<myTag></category>
<pubDate><myTag>Sun, 16 Sep 2012 12:00:00 EDT<myTag></pubDate>
<dc:creator><myTag>Bart Simpson<myTag></dc:creator>

I don’t think this question has been asked before on SO (I tried a few different SO queries). I’ve also tried a few different approaches using .findAll('cdata', text=True) and the applying the BeautifulSoup replaceWith() method to each resulting NavigableString. The attempts I’ve made have resulted in either no substitution, or what looks like a recursive loop.

I’m happy to post my previous attempts, but given that the problem here is quite simple I’m hoping someone can post a clear example of how to accomplish the search-and-replace above using BeautifulSoup 3.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T17:03:17+00:00

CData is a subclass of NavigableString, so you can find all CData
elements by first searching for all NavigableString objects, and then testing
whether each is an instance of CData. Once you’ve got one, it’s easily
replaced using replaceWith, as you suggested:

>>> from BeautifulSoup import BeautifulSoup, CData, Tag
>>> source = """
... <title>The end of the world as we know it</title>
... <category><![CDATA[Planking Dancing]]></category>
... <pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate>
... <dc:creator><![CDATA[Bart Simpson]]></dc:creator>
... """
>>> soup = BeautifulSoup(source)
>>> for navstr in soup(text=True):
...     if isinstance(navstr, CData):
...         tag = Tag(soup, "myTag")
...         tag.insert(0, navstr[:])
...         navstr.replaceWith(tag)
... 
>>> soup

<title>The end of the world as we know it</title>
<category><myTag>Planking Dancing</myTag></category>
<pubdate><myTag>Sun, 16 Sep 2012 12:00:00 EDT</myTag></pubdate>
<dc:creator><myTag>Bart Simpson</myTag></dc:creator>

>>>

A couple of notes:

you can call a BeautifulSoup object as though it were a function, and the
effect is the same as calling its .findAll() method.
The only way I know to get the content of a CData object in BS3 is to slice
it, as in the snippet above. str(navstr) would keep all the
<![CDATA[...]]> junk, which obviously you don’t want. In BS4, str(navstr)
gives you the content without the junk.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am parsing several XML document feeds with BeautifulSoup, and would like to do

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply