Problem*
Given some data (text) which has style applied to it with a loosely defined markup, such as:
The [blower]cat[elower] [weight 15]sat[normal] on the mat.[newline]
Which would ideally be represented as something like:
The <text class="lower">cat</text> <strong>sat</strong> on the mat.<br />
The markup has the following properties:
- A tag represents an instruction to format text in a given way from that point onward
- An end tag may exist, but only for a small set of tags. Other tags are linear (see point 1)
- Each tag has it’s own behaviour, and may affect previously applied tags in different ways
- Some nesting is implied from the linear tags adding to or overwriting existing styles
- Metadata may be outside of tags (eg. [beg][xyz]cmd[end1] is all tag related, no content)
Requirements
- Define rules around tag interaction (eg. A style tag such as [bold] is closed by another style tag such as [normal] or [light])
- Nesting of some content (tags which do not overwrite as above will nest and break accordingly)
- Define maps from the well defined in memory representation to some output format
Thoughts
- Parse into DOM like structure – Attempt to group tags into ‘sets’. When a tag is encountered, close the active tag for that set and open the new one. This produces <tag>content</tag>. Problems around proper nesting and closing/reopening tags so that you dont end up with overlap situations like <b>text<i>text</b>text</i> are annoying but straight forward enough.
How would you set about designing a data structure or method of parsing the content such that a set of rules can aid transformation to a well defined structure?
Alternatively, any suggestions for fields/areas that you would look at when solving this sort of problem?
*Real world problem
This problem is isomorphic (at least as you’ve described it so far) to XML. You have syntax that introduces and ends markup, and it comes mostly in pairs [blower]…[elower] and [weight 15]…[normal] with the occasional standalone [newline].
So if you know how to build an XML parser with tags, you know how to do this, too.
If you don’t, you just need a grammar (in EBNF) and a parser generator:
This requires a pretty simple lexer, and a pretty simple parser. (See FLEX and YACC as examples).
You can build your DOM as a set of tree nodes as the parser runs by attaching actions to the grammar rules (See YACC documentation). Many other parser generators will let you build the tree as you parse, too.