I’m trying to build a bbcode parser, but I’m having quite some problems figuring out how to avoid matching too widely. For example I want to implement a [list] to conversion like this:
\[list\](.*)\[/list\]
would be replaced by this:
<ul>$1</ul>
This works fine, except if I have two lists where the regular expression matches the beginning tag of the first list and the ending tag of the second. So this
[list]list1[/list] [list]list2[/list]
becomes this:
<ul>list1[/list] [list]list2</ul>
which produces really ugly output. Any idea on how to fix this?
If what you are doing is not just a lightweight hack, but something more permanent, you probably want to move to a real parser. Regexps in Java are particularly slow (even with precompiled patterns) and matching nested constructs (especially different nested contructs like ‘foo [u][i] bar [s]baz[/s][/i][/u]’ ) is going to be a royal pain.
Instead, try using a state-based parser, that repeatedly cuts your sentence in sections like ‘foo ‘ / (u) / ‘[i] bar [s]baz[/s][/i][/u]’, and maintains a set of states that flip whenever you encounter the matching construct delimiter.