I’m trying to write a regex in python to parse a Newick tree, but for the life of me I can’t get the last part of it to match. There are three types of Newick formats I need to parse:
((A,B),C);
((A:0.1,B:0.2),C:0.3);
((A:[c1]0.1,B:[c2]0.2),C:[c2]0.3);
…each of which contains three labels (A, B, C) and various other bits of information. I want to get the three labels. Here’s my regex:
regex = re.compile(r"""
(
([,(]) # boundary
([A-Z0-9_\-\.]+) # label
(:)? # optional colon
(\[.+?\])? # optional comment chunk
(\d+\.\d+)? # optional branchlengths
([),]) # end!
)
""", re.IGNORECASE + re.VERBOSE + re.DOTALL)
… however, I only get A and C. Not ever B. I’ve tracked the glitch down to the last captured group ([),]) – if I remove this, then I get all A, B, and C. Please help – what’s going wrong here?!
The problem is probably that you’re looking for non-overlapping instances of the regex.
Methods like
findallwon’t return B as the match for A consumes the,beforeB.Changing the end pattern to look ahead (so that it doesn’t consume anything) solves the problem.
Otherwise, instead of using
findall, you can usesearchiteratively and monkey with theposargument.Something like this: