so it looks like this question has been asked for just about every language under the sun……except for in C++. I have an XML document that has some bbcode stored within the text node. I am looking for the best way to remove it and I thought I’d check here to see if anyone was aware of some pre-built library or some efficient method of accomplishing this myself. I was thinking of maybe deleting anything that falls between a ‘[‘ and a ‘]’ character however, this gets insane using the XML documents that have been provided to me because many of the instances of the BB are in the form '[[blahblahblah]]' and some '[blahblahblah].'
Here is the XML document. All data between the <text> tags gets added into a string, any suggestions?
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
<page>
<title>Human Anatomy/Osteology/Axialskeleton</title>
<ns>0</ns>
<id>181313</id>
<revision>
<id>1481605</id>
<parentid>1379871</parentid>
<timestamp>2009-04-26T02:03:12Z</timestamp>
<contributor>
<username>Adrignola</username>
<id>169232</id>
</contributor>
<minor />
<comment>+Category</comment>
<sha1>hvxozde19haz4yhwj73ez82tf2bocbz</sha1>
<text xml:space="preserve"> [[Image:Axial_skeleton_diagram.svg|thumb|240px|right|Diagram of the axial skeleton]]
The Axial Skeleton is a division of the human skeleton and is named because it makes up the longitudinal ''axis'' of the body. It consists of the skull, hyoid bone, vertebral column, sternum and ribs. It is widely accepted to be made up of 80 bones, although this number varies from individual to individual.
[[Category:{{FULLBOOKNAME}}|{{FULLCHAPTERNAME}}]]</text>
</revision>
</page>
<page>
<title>Horn/General/Fingering Chart</title>
<ns>0</ns>
<id>23346</id>
<revision>
<id>1942387</id>
<parentid>1734837</parentid>
<timestamp>2010-10-02T20:21:09Z</timestamp>
<contributor>
<username>Nat682</username>
<id>144010</id>
</contributor>
<comment>added important note</comment>
<sha1>lana7m8m9r23oor0nh24ky45v71sai9</sha1>
<text xml:space="preserve">{{HornNavGeneral}}
The horn spans four plus octaves depending on the player and uses both the treble and bass clefs. In this chart it is assumed the player is using a double-horn with F and Bb sides. The number 1 indicates that the index-finger valve should be depressed, the number 2 indicates that the middle-finger valve should be depressed and the number 3 indicates that the ring-finger valve should be depressed. There are eight possible valve combinations among the first, second and third valves: 0, 1, 2, 3, 1-2, 1-3, 2-3, and 1-2-3. However, there are effectively seven combinations, because 1-2 will produce the same notes, perhaps slightly out of tune, as 3 alone. One depresses the thumb key to use the Bb side of the horn.
[[Image:Fingering chart.png]]
[[Category:Horn]]</text>
</revision>
</page>
</mediawiki>
so if you look at the bottom of each <page> tag, you will see things like [[Category:{{FULLBOOKNAME}}|{{FULLCHAPTERNAME}}]] and that’s what Im looking to remove.
I’ll assume the data is given to you in the form of an iterator you can read from. If you are getting it in the form of a
std::string, getting an iterator you can read from is pretty easy.In that case, what you want is a boost
filter_iterator: http://www.boost.org/doc/libs/1_39_0/libs/iterator/doc/filter_iterator.htmlThe filter function you want is pretty easy. You keep track of how many
[you have seen and subtract how many]you have seen (stopping at 0). While your count is positive, you filter out the character.If you cannot use
boost, but you are getting it from astd::string, well, that is a bit trickier. But only a bit.std::copy_ifworks.If you are using C++11, a lambda makes this really easy. If not, you’ll have to write your own functor that counts
[s.As a concrete example of a simple case: you are being fed a
std::stringand want to produce astd::stringwithout any[]delimited contents.which handles arbitrary depths of nested
[]s.The
filter_iteratorhelps in that you never have to have the entire string loaded into memory, which is useful if you don’t know how malformed your input will be. Loading a few terrabytes of data from disk into memory just to filter out[]is not needed, when you could stream the stuff and do the filtering on the fly. But your use case may not really care.