I’m trying to write a regex in python to parse a Newick tree, but

Question

0

Asked: May 26, 20262026-05-26T15:55:48+00:00 2026-05-26T15:55:48+00:00

I’m trying to write a regex in python to parse a Newick tree, but

0

I’m trying to write a regex in python to parse a Newick tree, but for the life of me I can’t get the last part of it to match. There are three types of Newick formats I need to parse:

((A,B),C);
((A:0.1,B:0.2),C:0.3);
((A:[c1]0.1,B:[c2]0.2),C:[c2]0.3);

…each of which contains three labels (A, B, C) and various other bits of information. I want to get the three labels. Here’s my regex:

regex = re.compile(r"""
(
    ([,(])              # boundary
    ([A-Z0-9_\-\.]+)    # label
    (:)?                # optional colon
    (\[.+?\])?          # optional comment chunk
    (\d+\.\d+)?         # optional branchlengths
    ([),])              # end!
)
""", re.IGNORECASE + re.VERBOSE + re.DOTALL)

… however, I only get A and C. Not ever B. I’ve tracked the glitch down to the last captured group ([),]) – if I remove this, then I get all A, B, and C. Please help – what’s going wrong here?!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T15:55:48+00:00

The problem is probably that you’re looking for non-overlapping instances of the regex.
Methods like findall won’t return B as the match for A consumes the , before B.

>>> regex.findall("((A:[c1]0.1,B:[c2]0.2),C:[c2]0.3);")
[('(A:[c1]0.1,', '(', 'A', ':', '[c1]', '0.1', ','), (',C:[c2]0.3)', ',', 'C', ':', '[c2]', '0.3', ')')]

Changing the end pattern to look ahead (so that it doesn’t consume anything) solves the problem.

>>> regex = re.compile(r"""
... (
...     ([,(])              # boundary
...     ([A-Z0-9_\-\.]+)    # label
...     (:)?                # optional colon
...     (\[.+?\])?          # optional comment chunk
...     (\d+\.\d+)?         # optional branchlengths
...     (?=[),])            # end!
... )
... """, re.IGNORECASE + re.VERBOSE + re.DOTALL)
>>>
>>> regex.findall("((A:[c1]0.1,B:[c2]0.2),C:[c2]0.3);")
[('(A:[c1]0.1', '(', 'A', ':', '[c1]', '0.1'), (',B:[c2]0.2', ',', 'B', ':', '[c2]', '0.2'), (',C:[c2]0.3', ',', 'C', ':
', '[c2]', '0.3')]
>>>

Otherwise, instead of using findall, you can use search iteratively and monkey with the pos argument.

Something like this:

>>> x = "((A:[c1]0.1,B:[c2]0.2),C:[c2]0.3);"
>>> r = []
>>> index = 0
>>> while True:
...     m = regex.search(x, index)
...     if not m:
...        break
...     r.append(m.groups())
...     index = m.end(7)-1
...
>>> r
[('(A:[c1]0.1,', '(', 'A', ':', '[c1]', '0.1', ','), (',B:[c2]0.2)', ',', 'B', ':', '[c2]', '0.2', ')'), (',C:[c2]0.3)',
 ',', 'C', ':', '[c2]', '0.3', ')')]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to write a regex in python to parse a Newick tree, but

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply