I need to create an XML document from a piece of plain text and the begin and end offsets of each XML element that should be inserted. Here are a few test cases I’d like it to pass:
val text = "The dog chased the cat."
val spans = Seq(
(0, 23, <xml/>),
(4, 22, <phrase/>),
(4, 7, <token/>))
val expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
assert(expected === spansToXML(text, spans))
val text = "aabbccdd"
val spans = Seq(
(0, 8, <xml x="1"/>),
(0, 4, <ab y="foo"/>),
(4, 8, <cd z="42>3"/>))
val expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
assert(expected === spansToXML(text, spans))
val spans = Seq(
(0, 1, <a/>),
(0, 0, <b/>),
(0, 0, <c/>),
(1, 1, <d/>),
(1, 1, <e/>))
assert(<a><b/><c/> <d/><e/></a> === spansToXML(" ", spans))
My partial solution (see my answer below) works by string concatenation and XML.loadString. That seems hacky, and I’m also not 100% sure this solution works correctly in all the corner cases…
Any better solutions? (For what it’s worth, I’d be happy to switch to anti-xml if that would make this task easier.)
Updated 10 Aug 2011 to add more test cases and provide a cleaner specification.
Given the bounty you put forward, I studied your problem for some time and came up with the following solution, which succeeds on all your testcases.
I would really like getting my answer accepted – please tell me if there’s anything wrong with my solution.
Some comments:
I left the commented out print statement inside, if you wanna figure what’s going on during execution.
In addition to your specification, I do preserve their existing children (if any) – there’s a comment where this is done.
I do not build the XML nodes manually, I modify the ones passed in. To avoid splitting the opening and closing tag, I had to change the algorithm quite a lot, but the idea of sorting spans by
beginand-endcomes from your solution.The code is somewhat advanced Scala, especially when I build the different
OrderingsI need. I did simplify it somewhat from the first version I got.I avoided creating a tree representing the intervals, by using a
SortedMap, and filtering the intervals after extraction. This choice is somewhat suboptimal; however, I heard that there are “better” data structures for representing nested intervals, like interval trees (they are studied in computational geometry), but they’re quite complex to implement, and I don’t think it’s needed here.