This is my first script, and I am trying to compare two genome files, one of which has more data points than the other.
The content of the files looks like this:
rs3094315 1 742429 AA
rs12562034 1 758311 GG
rs3934834 1 995669 CC
There are tabs between each field. There’s about 500,000 lines in each file.
In order to compare them easily, I wanted to keep only the data points that both the files contained, and discard any data points unique to either of them. To do this, I have created a list of all the DNA positions that are unique and now I am trying to search through each line of the original datafile and print all lines NOT containing these unique DNA positions to a new file.
Everything in my code has worked up until I try to search through the genome file using regex to print all non-unique DNA positions. I can get the script to print all items in the LaurelSNP_left list inside the for loop, but when I try to use re.match for each item, I get this error message:
Traceback (most recent call last):
File "/Users/laurelhochstetler/scripts/identify_SNPs.py", line 57, in <module>
if re.match(item,"(.*)", Line):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 137, in match
return _compile(pattern, flags).match(string)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile
p = sre_compile.compile(pattern, flags)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_compile.py", line 500, in compile
p = sre_parse.parse(p, flags)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 673, in parse
p = _parse_sub(source, pattern, 0)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 308, in _parse_sub
itemsappend(_parse(source, state))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 401, in _parse
if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'
My question is two-fold:
- How can I use my list in an regex expression?
- Is there a better way to accomplish what I am trying to do here?
Here’s my code:
#!/usr/bin/env python
import re #this imports regular expression module
import collections
MomGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Mary_Maloney_Full_20110514145353.txt', 'r')
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r')
LineNumber = 0
momSNP = []
LaurelSNP = []
f = open("mom_edit.txt","w")
for Line in MomGenome:
if LineNumber > 0:
Line=Line.strip('\n')
ElementList=Line.split('\t')
momSNP.append(ElementList[0])
LineNumber = LineNumber + 1
MomGenome.close()
for Line in LaurelGenome:
if LineNumber > 0:
Line=Line.strip('\n')
ElementList=Line.split('\t')
LaurelSNP.append(ElementList[0])
LineNumber = LineNumber + 1
momSNP_multiset = collections.Counter(momSNP)
LaurelSNP_multiset = collections.Counter(LaurelSNP)
overlap = list((momSNP_multiset and LaurelSNP_multiset).elements())
momSNP_left = list((momSNP_multiset - LaurelSNP_multiset).elements())
LaurelSNP_left = list((LaurelSNP_multiset - momSNP_multiset).elements())
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r')
i = 0
for Line in LaurelGenome:
for item in LaurelSNP_left:
if i < 1961:
if re.match(item, Line):
pass
else:
print Line
i = i + 1
LineNumber = LineNumber + 1
You want to print every line from file 2 whose ID does not occur in file 1. Make a set of the IDs in file 1, and use them as you loop through file 2:
This only needs to store the 500k SNPs, so it shouldn’t be too much of a problem memory-wise.