I will like to implement a simple Wiki-like mark up parser as a exercise of using Scala parser combinator.
I would like to solve this bit by bit, so here is what I would like to achieve in the first version: a simple inline literal markup.
For example, if the input string is:
This is a sytax test ``code here`` . Hello ``World``
The output string should be:
This is a sytax test <code>code here</code> . Hello <code>World</code>
I try to solve this by using RegexParsers, and here is what I’ve done now:
import scala.util.parsing.combinator._
import scala.util.parsing.input._
object TestParser extends RegexParsers
{
override val skipWhitespace = false
def toHTML(s: String) = "<code>" + s.drop(2).dropRight(2) + "</code>"
val words = """(.)""".r
val literal = """\B``(.)*``\B""".r ^^ toHTML
val markup = (literal | words)*
def run(s: String) = parseAll(markup, s) match {
case Success(xs, next) => xs.mkString
case _ => "fail"
}
}
println (TestParser.run("This is a sytax test ``code here`` . Hello ``World``"))
In this code, a simpler input which only contains one <code> markup works fine, for example:
This is a sytax test ``code here``.
become
This is a sytax test <code>code here</code>.
But when I run it with above example, it will yield
This is a sytax test <code>code here`` . Hello ``World</code>
I think this is because the regex I use:
"""\B``(.)*``\B""".r
allowed any characters in `` pairs.
I would like to know know should I limit there could not have nested `` and fix this problem?
Here’s some docs on non-greedy matching:
http://www.exampledepot.com/egs/java.util.regex/Greedy.html
Basically it’s starting at the first “ and going as far as it can to get a match, which matches the “ at the end of world.
By putting a ? after your *, you tell it to do the shortest match possible, instead of the longest match.
Another option is to use [^`]* (anything EXCEPT `), and that will force it to stop earlier.