So, I’m digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which

Question

0

Asked: May 22, 20262026-05-22T14:41:11+00:00 2026-05-22T14:41:11+00:00

So, I’m digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which

0

So, I’m digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which cleaves before the proteins coded by B or D in a single-letter coded sequence. My actual analysis uses String#scan for the captures. I’m trying to figure out why the following regular expression doesn’t digest it correctly…

(\w*?)(?=[BD])|(.*\b)

where the antecedent (.*\b) exists to capture the end of the sequence.
For:

MTMDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN

This should give something like: [MTM, DKPSQY, DKIEAELQ, DICN, DVLELL, DSKG, ... ] but instead misses each D in the sequence.

I’ve been using http://www.rubular.com for troubleshooting, which runs on 1.8.7 although I’ve also tested this REGEX on 1.9.2 to no avail. It is my understanding that zero-width lookahead assertions are supported in both versions of ruby. What am I doing wrong with my regex?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T14:41:12+00:00

The simplest way to support this is to split on the zero-width lookahead:

s = "MTMDKPSQYDKIEAELQDICNDVLELLDSKG"
p s.split /(?=[BD])/
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

For understanding as to what was going wrong with your solution, let’s look first at your regex versus one that works:

p s.scan(/.*?(?=[BD]|$)/)
#=> ["MTM", "", "KPSQY", "", "KIEAELQ", "", "ICN", "", "VLELL", "", "SKG", ""]

p s.scan(/.+?(?=[BD]|$)/)
#=> ["MTM", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG"]

The problem is that if you can capture zero characters and still match your zero-width lookahead, you succeed without advancing the scanning pointer. Let’s look at a simpler-but-similar test case:

s = "abcd"
p s.scan //      # Match any position, without advancing
#=> ["", "", "", "", ""]

p s.scan /(?=.)/ # Anywhere that is followed by a character, without advancing
#=> ["", "", "", ""]

A naive implementation of String#scan might get stuck in an infinite loop, repeatedly matching with the pointer before the first character. It appears that once a match occurs without advancing the pointer the algorithm forcibly advances the pointer by one character. This explains the results in your case:

First it matches all the characters up to a B or D,
then it matches the zero-width position right before the B or D, without moving the character pointer,
as a result the algorithm moves the pointer past the B or D, and continues on after that.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So, I’m digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply