Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7129231
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T11:13:16+00:00 2026-05-28T11:13:16+00:00

We use zero-width regex strings to specify the places in a string of amino

  • 0

We use zero-width regex strings to specify the places in a string of amino acid symbols (basically A-Z) that are valid cleavage sites. For example, the proteolytic enzyme trypsin cleaves after K or R except when followed by P ((?<=[KR])(?!P)). I want to convert these regexes to the “cut/no-cut” notation also common in this field. For example, trypsin cuts after “KR” with a no-cut of “P”. My first attempt at this works for simple cases:

// match zero or one regex term like (?<=[KR]) or (?<=K) or (?<![KR]) or (?<!K)
// followed by zero or one term like (?=[KR]) or (?=K) or (?![KR]) or (?!K)
boost::regex cutNoCutRegex("(?:\\(+\\?<([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?(?:\\(+\\?([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?");

Without the C++ escaping, that’s:

(?:\(+\?<([=!])(\[[A-Z]+\]|[A-Z])\)+)?(?:\(+\?([=!])(\[[A-Z]+\]|[A-Z])\)+)?

I’d like to change this to support somewhat more complicated regexes, such as multiple characters, non-capturing groups, character sets, ranges in character sets, negated sets, and start/end of string:
(?<=K|R) or (?<=(?:K)|(?:R)) or (?<=[^A-JL-QS-Z]) or (?<=^M|[KR])

These extra features would seem to explode the complexity of the regex. I’m pretty sure I’ll need to enable the “experimental” BOOST_REGEX_MATCH_EXTRA feature of Boost.Regex. Is there a better way to do what I’m doing? Am I missing some other regex possibilities in zero-width regexes?

Here is pseudo-code for my unit tests for the existing code covering many of the simple cases. The “sense” member is “C” when the “cut” field corresponds to the look-behind, and “N” when the “cut” field corresponds to the look-ahead. The current pepXMLSpecificity() function can invert the character set if it would produce a shorter list.

struct PepXMLSpecificity { std::string cut, no_cut, sense; };
void unit_assert_equal(string expected, string actual);

"(?<=[QWERTY])(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.no_cut);

"(?<![QWERTY])(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("QWERTY", result.no_cut);

"(?<=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);

"(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);

"(?<![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);

"(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);

// the following tests aren't supported yet

"(?<=^M)|(?<=[KR])"
unit_assert_equal("N", result.sense);
unit_assert_equal("KR", result.cut); // the 'M' part is dropped
unit_assert_equal("", result.no_cut);

"(?<=K|R)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);

"(?<=(?:K)|(?:R))"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);

"(?<=[^A-JL-QS-Z])(?!P)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("P", result.no_cut);
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T11:13:16+00:00Added an answer on May 28, 2026 at 11:13 am

    As I understand it the situation is, you have a library of existing regexes which, when applied to common string representations of aminio-acid sequences, identify probable cut-points for a proteolytic enzyme.

    You want to automatically produce a standard textual description of the cutpoints implied by the regex.

    Observations:

    • You don’t need to be able to parse arbitrary regexes, you only need to be able to parse the cases that you actually have in your library.

    • You don’t necessarily need to parse all of them. Particularly difficult ones could be kicked out and done by hand provided there weren’t too many of them.

    Really I think you need to do the following.

    1. pepXMLSpecificity needs to return one or more descriptions, i.e. a vector<struct PepXMLSpecificity>, since a regex can be authored to combine arbitrary regexes cf. jpalacek’s comment.

    2. You should tackle the actual contents of your library of regexes starting with the common cases and working down, and just add special cases for each common type of regex until you have got them all (or at least enough of them to satisfy your boss).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a collection of strings that need to be parsed (using RegEx patterns)
We use a data acquisition card to take readings from a device that increases
I am required to use custom fonts in my application. Problem: For ListViews that
I'm trying to figure out a way to create a zero-width, zero-height image which
I am a GWT person with zero jquery experience. Sorry about that. Unfortunately, I
My goal is to match the first 0 and everything after that zero in
'''use Jython''' import shutil print dir(shutil) There is no, shutil.move, how does one move
Use case: A does something on his box and gots stuck. He asks B
Use case: 3rd party application wants to programatically monitor a text file being generated
Use case: user clicks the link on a webpage - boom! load of files

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.