Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1030887
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T13:50:31+00:00 2026-05-16T13:50:31+00:00

I have markup language which is similar to markdown and the one used by

  • 0

I have markup language which is similar to markdown and the one used by SO.

Legacy parser was based on regexes and was complete nightmare to maintain, so I’ve come up with my own solution based on EBNF grammar and implemented via mxTextTools/SimpleParse.

However, there are issues with some tokens which may include each other, and I don’t see a ‘right’ way to do it.

Here is part of my grammar:

newline          := "\r\n"/"\n"/"\r"
indent           := ("\r\n"/"\n"/"\r"), [ \t]
number           := [0-9]+
whitespace       := [ \t]+
symbol_mark      := [*_>#`%]
symbol_mark_noa  := [_>#`%]
symbol_mark_nou  := [*>#`%]
symbol_mark_nop  := [*_>#`]
punctuation      := [\(\)\,\.\!\?]
noaccent_code    := -(newline / '`')+
accent_code      := -(newline / '``')+
symbol           := -(whitespace / newline)
text             := -newline+
safe_text        := -(newline / whitespace / [*_>#`] / '%%' / punctuation)+/whitespace
link             := 'http' / 'ftp', 's'?, '://', (-[ \t\r\n<>`^'"*\,\.\!\?]/([,\.\?],?-[ \t\r\n<>`^'"*]))+
strikedout       := -[ \t\r\n*_>#`^]+
ctrlw            := '^W'+
ctrlh            := '^H'+
strikeout        := (strikedout, (whitespace, strikedout)*, ctrlw) / (strikedout, ctrlh)
strong           := ('**', (inline_nostrong/symbol), (inline_safe_nostrong/symbol_mark_noa)* , '**') / ('__' , (inline_nostrong/symbol), (inline_safe_nostrong/symbol_mark_nou)*, '__')
emphasis              := ('*',?-'*', (inline_noast/symbol), (inline_safe_noast/symbol_mark_noa)*, '*') / ('_',?-'_', (inline_nound/symbol), (inline_safe_nound/symbol_mark_nou)*, '_')
inline_code           := ('`' , noaccent_code , '`') / ('``' , accent_code , '``')
inline_spoiler        := ('%%', (inline_nospoiler/symbol), (inline_safe_nop/symbol_mark_nop)*, '%%')
inline                := (inline_code / inline_spoiler / strikeout / strong / emphasis / link)
inline_nostrong       := (?-('**'/'__'),(inline_code / reference / signature / inline_spoiler / strikeout / emphasis / link))
inline_nospoiler       := (?-'%%',(inline_code / emphasis / strikeout / emphasis / link))
inline_noast          := (?-'*',(inline_code / inline_spoiler / strikeout / strong / link))
inline_nound          := (?-'_',(inline_code / inline_spoiler / strikeout / strong / link))
inline_safe           := (inline_code / inline_spoiler / strikeout / strong / emphasis / link / safe_text / punctuation)+
inline_safe_nostrong  := (?-('**'/'__'),(inline_code / inline_spoiler / strikeout / emphasis / link / safe_text / punctuation))+
inline_safe_noast     := (?-'*',(inline_code / inline_spoiler / strikeout / strong / link / safe_text / punctuation))+
inline_safe_nound     := (?-'_',(inline_code / inline_spoiler / strikeout / strong / link / safe_text / punctuation))+
inline_safe_nop        := (?-'%%',(inline_code / emphasis / strikeout / strong / link / safe_text / punctuation))+
inline_full           := (inline_code / inline_spoiler / strikeout / strong / emphasis / link / safe_text / punctuation / symbol_mark / text)+
line                  := newline, ?-[ \t], inline_full?
sub_cite              := whitespace?, ?-reference, '>'
cite                  := newline, whitespace?, '>', sub_cite*, inline_full?
code                  := newline, [ \t], [ \t], [ \t], [ \t], text
block_cite            := cite+
block_code            := code+
all                   := (block_cite / block_code / line / code)+

First problem is, spoiler, strong and emphasis can include each other in arbitrary order. And its possible that later I’ll need more such inline markups.

My current solution involves just creating separate token for each combination (inline_noast, inline_nostrong, etc), but obviously, number of such combinations grows too fast with growing number of markup elements.

Second problem is that these lookaheads in strong/emphasis behave VERY poorly on some cases of bad markup like __._.__*__.__...___._.____.__**___*** (lots of randomly placed markup symbols). It takes minutes to parse few kb of such random text.

Is it something wrong with my grammar or I should use some other kind of parser for this task?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T13:50:32+00:00Added an answer on May 16, 2026 at 1:50 pm

    If one thing includes another, then normally you treat them as separate tokens and then nest them in the grammar. Lepl (http://www.acooke.org/lepl which I wrote) and PyParsing (which is probably the most popular pure-Python parser) both allow you to nest things recursively.

    So in Lepl you could write code something like:

    # these are tokens (defined as regexps)
    stg_marker = Token(r'\*\*')
    emp_marker = Token(r'\*') # tokens are longest match, so strong is preferred if possible
    spo_marker = Token(r'%%')
    ....
    # grammar rules combine tokens
    contents = Delayed() # this will be defined later and lets us recurse
    strong = stg_marker + contents + stg_marker
    emphasis = emp_marker + contents + emp_marker
    spoiler = spo_marker + contents + spo_marker
    other_stuff = .....
    contents += strong | emphasis | spoiler | other_stuff # this defines contents recursively
    

    Then you can see, I hope, how contents will match nested use of strong, emphasis, etc.

    There’s much more than this to do for your final solution, and efficiency could be an issue in any pure-Python parser (There are some parsers that are implemented in C but callable from Python. These will be faster, but may be trickier to use; I can’t recommend any because I haven’t used them).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm modifying PHP Markdown (a PHP parser of the markup language which is used
In my own markup language, I have a quotation tags >>which use these characters
I have some markup similar to the following: <select> <option selected=selected>Apple</option> <option selected=>Orange</option> </select>
i have a markup which look like this: <h3>Paragraf3-dummytext</h3> <p> <a name=paragraf3> Quisque id
I have a basic page to which I'm adding an uploader control based on
Issue: I have a markup like this (only the important lines): <%@ Control Language=C#
Is there a commonly used markup language for graphs (the topological kind). I would
I have some markup for a set of fields within a form: <ul> <li>
I have this markup: <ul><li id=link-notifications> <ul class=list-notifications style=display:none> <li><a href=/link/to>Item 1</a></li> </ul> </li>
I have this markup: <div id=slider1> <div class=ui-slider ui-slider-horizontal ui-widget ui-widget-content ui-corner-all> <a style=left:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.