Sorry for the confusing title but I’m not sure how better to explain it.
I am building a simple web server for a school project that has to parse a custom scripting language. I have a line that looks like this:
<p>Here's the date: <% pr date() %></p><p>Here's the date again: <% pr date() %></p>
I’m using the following regular expression to try and pull out the <% … %> stuff…
<% *(.*) *%>
The problem is it is matching from the first open tag to the last closing tag, rather than from the first open tag to the first closing tag. So the resulting match is this:
<% pr date() %></p><p>Here's the date again: <% pr date() %>
…instead of:
<% pr date() %>
I thought I could solve it by using something like this, but it doesn’t seem to work:
<% *([^(<%)]*) *%>
…but it doesn’t seem to work. Any help is appreciated, thanks.
You need a non-greedy match which stops upon the first time a match is recognized:
The non-greedy quantifier can of course be applied to most other patterns.
However, I would advice against using regexes. The meta-pattern
OPEN-TOKEN CONTENT CLOSE-TOKENis simple enough for a hand written parser/scanner. It will then also be easier for you to recognize when your tags are within comments (and possibly other cases were you don’t want a match):Code like above might not be encouraged by you, but you have to consider that.
Footnote: Each time you
(write a parser|fire a regular expression), you are in prison with one leg already.