Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6657331
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T01:45:17+00:00 2026-05-26T01:45:17+00:00

Good Early Morning, I have the following python regex file that we established on

  • 0

Good Early Morning,

I have the following python regex file that we established on a previous post. This is meant to extract whatever info that looks like ‘chr’ + number + ‘:’ + bignumber “..” + bignumber (so that looks like chr1:100000..120000)
if chr1 is switched for chrX the regex script doesn’t work anymore…

Here is the original script :

    # Opens each file to read/modify
    infile='myfile.txt'
    outfile='outfile.txt'

    #import Regex
    import re

    with open (infile, mode='r', buffering=-1) as in_f, open (outfile, mode='w', buffering=-1) as out_f:
        f = (i for i in in_f if '\t' in i.rstrip())
        for line in f:
            _, k = line.split('\t',1)
            x = re.findall(r'^1..100\t([+-])chr(\d+):(\d+)\.\.(\d+).+$',k)
            if not x:
                continue
            out_f.write(' '.join(x[0]) + '\n')

If I changed this line :

    x = re.findall(r'^1..100\t([+-])chrX(\d+):(\d+)\.\.(\d+).+$',k)

I cannot extract specifically whatever looks like chrX etc…
Also you should know that some lines could be empty !

Help Please 🙂 Thanks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T01:45:17+00:00Added an answer on May 26, 2026 at 1:45 am

    I don’t fully understand your question, but I will attempt to give some advice based on your code.

    Here is the most important line:

    x = re.findall(r'^1..100\t([+-])chr(\d+):(\d+)\.\.(\d+).+$',k)
    

    Observations:

    0) I don’t even know what buffering=-1 will do in a call to open(). I recommend you get rid of that, and allow the standard behavior, which is line buffering. It’s what you want for this case, where you want to process the file one line at a time. (The default is the same as specifying buffering=1.)

    1) re.findall() returns a list of matches. However, by using $ in your pattern you have guaranteed that you will get at most one match, because each line can only have one end-of-line. So you should probably use re.search(). You could even use re.match() since you have a ^ to anchor to the start of the line.

    2) I don’t recommend your use of the .split() method function to get rid of a leading tab. Just fold a tab into your regular expression. It’s simpler and faster.

    3) Your pattern requires that each line start with a string like this:

    1aa100
    100100
    1xx100
    1xy100
    

    Is this what you wanted? Does each line start with a number that always ends in “100”? If it’s always a number you might want to use \d instead of . in the pattern.

    4) You require a tab after the number-like thing matched above. Then you have a match group, which matches either a ‘+’ or a ‘-‘ and lets you collect the matched value. I’m curious what you will do with it.

    5) The pattern chr\d+ will match chr0, chr1, chr11, chr111, etc. Any combination of digits, with a minimum length of 1 digit. I’m not sure if you expect it to actually match a capital ‘X’ (you talked about matching chrX) but it definitely won’t.

    6) You match a number, two actual periods, and another number. This looks perfectly correct and good to me. Then, after the second number, you use a . and a + together. This requires one or more extra characters before the end of the line. I am wondering if this is causing your problem. Perhaps you should use .* which matches zero or more extra characters?

    7) If you use re.match() instead of re.findall(), you won’t need to use x[0] to get to the match group.

    8) If you have a match group m, ' '.join(m) does not work. You get a type error. You need to use ' '.join(m.groups()) instead.

    9) I think the pattern with chr and two numbers separated by .. is pretty good by itself, so maybe you can relax the rest of the pattern and just match on those.

    10) I always like to pre-compile my regular expression patterns. It’s faster, and then you can use the method functions on the compiled pattern. For example, if pat is a pre-compiled regular expression, you can use pat.search(line) to search a line of text.

    Put together my suggestions, and here is some Python code for you to try out:

    import re
    
    infile='myfile.txt'
    outfile='outfile.txt'
    
    pat = re.compile(r'([+-])chr([^:]+):(\d+)\.\.(\d+)')
    
    with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
        for line in in_f:
            if '\t' not in line.rstrip():
                continue
            m = pat.search(line)
            if not m:
                continue
            out_f.write(' '.join(m.groups()) + '\n')
    

    EDIT: Since you do seem to want to recognize the string chrX as valid, I changed the above example code. Instead of \d to match a digit, it now uses [^:] to match anything but a colon. The above code should match chr1:, chrX:, or pretty much anything else now.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Good morning. I have an XML file which contains lists of warning and errors
Good morning, I am the developer of a medium sized PDA application that will
This may have been answered before, but I cannot find a solution that works.
Good afternoon, I am currently in the very early phase of a new project
Good morning, I am about to start writing an Excel add-in for Excel 2002.
Good afternoon, This should be an easy one. I've done the cookie-cutter default ASP.NET
Good morning, I work in a small shop (only two of us) and we
Good afternoon, I have a web query in Excel 2002 going against a web
I'm currently in the early stages of designing a system that will end up
Experienced Objective-C/Cocoa Devs: What are the key concepts that I should absorb early on

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.