Good Early Morning,
I have the following python regex file that we established on a previous post. This is meant to extract whatever info that looks like ‘chr’ + number + ‘:’ + bignumber “..” + bignumber (so that looks like chr1:100000..120000)
if chr1 is switched for chrX the regex script doesn’t work anymore…
Here is the original script :
# Opens each file to read/modify
infile='myfile.txt'
outfile='outfile.txt'
#import Regex
import re
with open (infile, mode='r', buffering=-1) as in_f, open (outfile, mode='w', buffering=-1) as out_f:
f = (i for i in in_f if '\t' in i.rstrip())
for line in f:
_, k = line.split('\t',1)
x = re.findall(r'^1..100\t([+-])chr(\d+):(\d+)\.\.(\d+).+$',k)
if not x:
continue
out_f.write(' '.join(x[0]) + '\n')
If I changed this line :
x = re.findall(r'^1..100\t([+-])chrX(\d+):(\d+)\.\.(\d+).+$',k)
I cannot extract specifically whatever looks like chrX etc…
Also you should know that some lines could be empty !
Help Please 🙂 Thanks
I don’t fully understand your question, but I will attempt to give some advice based on your code.
Here is the most important line:
Observations:
0) I don’t even know what
buffering=-1will do in a call toopen(). I recommend you get rid of that, and allow the standard behavior, which is line buffering. It’s what you want for this case, where you want to process the file one line at a time. (The default is the same as specifyingbuffering=1.)1)
re.findall()returns a list of matches. However, by using$in your pattern you have guaranteed that you will get at most one match, because each line can only have one end-of-line. So you should probably usere.search(). You could even usere.match()since you have a^to anchor to the start of the line.2) I don’t recommend your use of the
.split()method function to get rid of a leading tab. Just fold a tab into your regular expression. It’s simpler and faster.3) Your pattern requires that each line start with a string like this:
Is this what you wanted? Does each line start with a number that always ends in “100”? If it’s always a number you might want to use
\dinstead of.in the pattern.4) You require a tab after the number-like thing matched above. Then you have a match group, which matches either a ‘+’ or a ‘-‘ and lets you collect the matched value. I’m curious what you will do with it.
5) The pattern
chr\d+will matchchr0,chr1,chr11,chr111, etc. Any combination of digits, with a minimum length of 1 digit. I’m not sure if you expect it to actually match a capital ‘X’ (you talked about matchingchrX) but it definitely won’t.6) You match a number, two actual periods, and another number. This looks perfectly correct and good to me. Then, after the second number, you use a
.and a+together. This requires one or more extra characters before the end of the line. I am wondering if this is causing your problem. Perhaps you should use.*which matches zero or more extra characters?7) If you use
re.match()instead ofre.findall(), you won’t need to usex[0]to get to the match group.8) If you have a match group
m,' '.join(m)does not work. You get a type error. You need to use' '.join(m.groups())instead.9) I think the pattern with
chrand two numbers separated by..is pretty good by itself, so maybe you can relax the rest of the pattern and just match on those.10) I always like to pre-compile my regular expression patterns. It’s faster, and then you can use the method functions on the compiled pattern. For example, if
patis a pre-compiled regular expression, you can usepat.search(line)to search a line of text.Put together my suggestions, and here is some Python code for you to try out:
EDIT: Since you do seem to want to recognize the string
chrXas valid, I changed the above example code. Instead of\dto match a digit, it now uses[^:]to match anything but a colon. The above code should matchchr1:,chrX:, or pretty much anything else now.