This is my first script, and I am trying to compare two genome files,

Question

0

Editorial Team

Asked: May 30, 20262026-05-30T06:44:04+00:00 2026-05-30T06:44:04+00:00

This is my first script, and I am trying to compare two genome files,

0

This is my first script, and I am trying to compare two genome files, one of which has more data points than the other.

The content of the files looks like this:

rs3094315       1       742429  AA
rs12562034      1       758311  GG
rs3934834       1       995669  CC

There are tabs between each field. There’s about 500,000 lines in each file.

In order to compare them easily, I wanted to keep only the data points that both the files contained, and discard any data points unique to either of them. To do this, I have created a list of all the DNA positions that are unique and now I am trying to search through each line of the original datafile and print all lines NOT containing these unique DNA positions to a new file.

Everything in my code has worked up until I try to search through the genome file using regex to print all non-unique DNA positions. I can get the script to print all items in the LaurelSNP_left list inside the for loop, but when I try to use re.match for each item, I get this error message:

Traceback (most recent call last):
  File "/Users/laurelhochstetler/scripts/identify_SNPs.py", line 57, in <module>
    if re.match(item,"(.*)", Line):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 137, in match
    return _compile(pattern, flags).match(string)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_compile.py", line 500, in compile
    p = sre_parse.parse(p, flags)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 673, in parse
    p = _parse_sub(source, pattern, 0)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 308, in _parse_sub
    itemsappend(_parse(source, state))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 401, in _parse
    if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'

My question is two-fold:

How can I use my list in an regex expression?
Is there a better way to accomplish what I am trying to do here?

Here’s my code:

#!/usr/bin/env python
import re #this imports regular expression module
import collections

MomGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Mary_Maloney_Full_20110514145353.txt', 'r')
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r')
LineNumber = 0 
momSNP = []
LaurelSNP = []
f = open("mom_edit.txt","w")
for Line in MomGenome:
    if LineNumber > 0:
        Line=Line.strip('\n')
        ElementList=Line.split('\t')

        momSNP.append(ElementList[0])

        LineNumber = LineNumber + 1
MomGenome.close()
for Line in LaurelGenome:
    if LineNumber > 0:
        Line=Line.strip('\n')
        ElementList=Line.split('\t')

        LaurelSNP.append(ElementList[0])

        LineNumber = LineNumber + 1
momSNP_multiset = collections.Counter(momSNP)            
LaurelSNP_multiset = collections.Counter(LaurelSNP)
overlap = list((momSNP_multiset and LaurelSNP_multiset).elements())
momSNP_left = list((momSNP_multiset - LaurelSNP_multiset).elements())
LaurelSNP_left = list((LaurelSNP_multiset - momSNP_multiset).elements())
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r')
i = 0
for Line in LaurelGenome:
    for item in LaurelSNP_left:
            if i < 1961:
                if re.match(item, Line):
                    pass

                else:
                    print Line

            i = i + 1
    LineNumber = LineNumber + 1

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T06:44:05+00:00

You want to print every line from file 2 whose ID does not occur in file 1. Make a set of the IDs in file 1, and use them as you loop through file 2:

momSNP = set()
for line in MomGenome:
    snp, rest = line.split(None, 1) # Split into two pieces only
    momSNP.add(snp)

for line in MyGenome:
    snp, rest = line.split(None, 1)
    if snp in momSNP:
        print line

This only needs to store the 500k SNPs, so it shouldn’t be too much of a problem memory-wise.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is my first script, and I am trying to compare two genome files,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply