I will try and keep this short and to the point.
Given the following
#!/usr/bin/python
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
para = etree.SubElement(sect,'para')
para.text = 'this is a [b]long[/b] block of text. Much longer than this example makes it out to be.'
how would I be best going about converting the output to what I have below. notice the [b]’s became element <b>
<root>
<sect>
<para>
this is a <b>long</b> block of text.
Much longer than this example makes it out to be.
</para>
</sect>
</root>
My real input and xml is considerably more complex. However, this is the gist of it. I have taken a standardly formatted text document and I am converting it to xml. The structure of the document is rather static. Therefore, this is not as crazy as it sounds. I currently have it broken into lines. This is relevant, because as I go through each line I have no trouble identifying <sect> or a <title>, but often times a <para> will have some extra formatting in its line. In this example, a [b], that needs to be converted yet again. What would be the best way of accomplishing this?
Items to keep in mind
-
the authors of my input texts are not always consistent. therefore, it would be best to develop a lose regexp to find [b] WORD [/b] or when the authors errors something like [b[WORD[/b]. my current idea is to match something like [b or b]
-
I am currently processing my input file line by line. I have removed any blank lines. should I consider processing this afterwards? I have no strong goal, but feel that this can be contained in a single loop through the text.
-
This will need to play well with lxml when I output my document. for example see the edit below with my comment on the bbc parser
I have worked on this most of the afternoon, and can discuss more of the routes I have taken. I will be working on this throughout the evening so if I come across other items to keep in mind I will update this question accordingly.
EDIT: Or my problem with bbc parser
Paul thoughtfully suggested postmarkup-1.1.4, however, as you can see it does not play well with lxml. converting the elements to enities. This was a problem I ran into this afternoon when I did this through a search and replace. Ultimately, this is a perfect sed solution. As was pointed out. However, I was hoping to have not be the end user of this script and would rather everything contained within one command.
>>> p.text = render_bbcode(p.text)
>>> p.text
'this is a <strong>long</strong> text string'
>>> etree.tostring(root)
'<root><p>this is a <strong>long</strong> text string</p></root>'
doing this in reverse returns equally poor results
>>> p.text
'this is a [b]long[/b] text string
>>> render_bbcode(etree.tostring(root))
u'<root><p>this is a <strong>long</strong> string</p></root>'
The postmarkup library seems to come closest to what you want to do.
http://pypi.python.org/pypi/postmarkup/1.1.4
Unfortunately it hasn’t seen a lot of development recently, but I don’t see any other libraries that look tons better.
Starting from there and modifying the existing elements to fit your syntax is probably faster than reinventing the parsing wheel from scratch.
If that isn’t a good direction, you might look at the more low-level syntax lexing and parsing, but that will rapidly get complex to the point that you might be better of with simple repetitive regexes and hand correction. How big is your corpus?
The final item of note is that tasks like this are precisely what
sedwas written to do. It can be amazingly powerful if you’re willing to learn how to use it. If you’re not already comfortable with it though, the Python might be easier.