I need to parse Newick format that is useful for trees. It looks like series of brackets, commas and letters denoted nodes:
(A,B,(C,D)E)F
or, for another example:
(,(((,(,)),),))
(,) element means nodes with same parent. For my purpose (to measure a path length between two leafs) I need consequentially to look for such nested elements.
So, my question is how to match different symbols same number of times?
For example, I want to match AB pattern in string:
CCCAAABBACCCABCCAAABBBBBBACCCCCABBBABBCCAABB
Regex should return: ['AABB','AB','AAABBB','AB','AB','AABB']
Every time the number of repetition is different. So A{n}B{n} doesn’t work.
Thanks.
Your problem is classic example what regular expressions can’t do.
http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages in section “Use of lemma” there is prove that language “a^nb^n” is not regular (so it can’t be recognized by regular expressions).
Using regular expression you can only create regular expressions for a given maximum
n. But expression for largencan take long to evaluate.PS. Your problem can be solved using Formal grammars (http://en.wikipedia.org/wiki/Formal_grammar) or Counter automaton (http://en.wikipedia.org/wiki/Counter_automaton).