Background
I have written very simple BBCode parser using C# which transforms BBCode to HTML. Currently it supports only [b], [i] and [u] tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML
Question
- Does standard “BBCode to HTML” specification exist?
- How should I handle
"[b][b][/b][/b]"? For now parser yields"<b>[b][/b]</b>". - How should I handle
"[b][i][u]zzz[/b][/i][/u]"input? Currently my parser is smart enough to produce"<b><i><u>zzz</u></i></b>"output for such case, but I wonder that it is “too smart” approach, or it is not?
More details
I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring “BBCode to HTML” transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]" is transformed to "<b><i>zzz</b></i>" (note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>". As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:
- Should I rely on this browsers feature and do not try to make XHTML.
- Maybe
"[b][i]zzz[/b]ccc[/i]"must be understood as"<b>[i]zzz</b>ccc[/i]"– looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])
Thanks.
On your first question, I don’t think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you’re actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.
Regarding your second question, since you’re trying to have correct BBCode converted to correct HTML, if your input is
[b][i]zzz[/b]ccc[/i], its correct HTML equivalent would be<i><b>zzz</b>ccc</i>and not<b>[i]zzz</b>ccc[/i]. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you’re stuck with regex.