I have another problem with my data set. Basically, there is a list of genes with associated features including position numbs (columns 3 and 4) and strand orientation (+ or -). I am trying to do a calculation with the positions to make them relative to the start codon TYPE (second column) for each gene, rather than the entire genome (as it is now). The problem is that, the calculation is only performed on the + STRAND sequences, the – STRAND sequences are not showing up in the output. Below is a sample of the data set, my code, the output, and what I’ve tried.
Here’s the data set:
GENE_ID TYPE POS1 POS2 STRAND
PITG_00002 start_codon 10520 10522 -
PITG_00002 stop_codon 10097 10099 -
PITG_00002 exon 10474 10522 -
PITG_00002 CDS 10474 10522 -
PITG_00002 exon 10171 10433 -
PITG_00002 CDS 10171 10433 -
PITG_00002 exon 10097 10114 -
PITG_00002 CDS 10100 10114 -
PITG_00003 start_codon 38775 38777 +
PITG_00003 stop_codon 39069 39071 +
PITG_00003 exon 38775 39071 +
PITG_00003 CDS 38775 39068 +
Here is the code:
import numpy
import pandas
import pandas as pd
import sys
sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')
groups = data.groupby(['STRAND', 'GENE_ID'])
corrected = []
for (direction, gene_name), group in groups:
##print direction,gene_name
if group.index[group.TYPE=='start_codon']:
start_exon = group.index[group.TYPE=='exon'][0]
if direction == '+':
group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
else:
group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
##print group
corrected.append(group)
Here is a sample of the output:
+ PITG_00003
GENE_ID TYPE POS1 POS2 STRAND POSA POSB
8 PITG_00003 start_codon 38775 38777 + 1 3
9 PITG_00003 stop_codon 39069 39071 + 295 297
10 PITG_00003 exon 38775 39071 + 1 297
11 PITG_00003 CDS 38775 39068 + 1 294
Previously I was getting an array value error (Tab delimited dataset ValueError truth of array with more than one element is ambiguous error) but that has been taken care of. So next I tried only doing this part:
import numpy
import pandas
import pandas as pd
import sys
##sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')#,
#converters={'STRAND': lambda s: s[0]})
groups = data.groupby(['STRAND', 'GENE_ID'])
corrected = []
for (direction, gene_name), group in groups:
print direction,gene_name
And the output printed out all the GENE_IDs and their STRAND symbol (+ or -), and it did it for both the + and – sequences. So somewhere below that it isn’t selecting any of the sequences with – in the STRAND column.
So I tried adding this to the original code:
if direction == '+':
group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
elif direction == '-':
group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
else:
break
print group
# put into the result array
corrected.append(group)
and this is the very end of the output, it printed the first – and then froze for awhile before ending:
+
GENE_ID TYPE POS1 POS2 STRAND POSA POSB
134991 PITG_23350 start_codon 161694 161696 + 516 518
134992 PITG_23350 stop_codon 162135 162137 + 957 959
134993 PITG_23350 exon 161179 162484 + 1 1306
134994 PITG_23350 CDS 161694 162134 + 516 956
-
These lines seem weird to me:
The first, I’m guessing, is simply trying to check to see whether the group has a start codon marker. But that doesn’t make sense for two reasons.
(1) If there’s only one start_codon entry and it’s the first, then the condition is actually false!
Maybe you want
any(group.TYPE == 'start_codon'), or(group.TYPE == 'start_codon').any()orsum(group.TYPE == 'start_codon') == 1or something? But that can’t be right either, because(2) Your code only works if
start_exonis set. If it isn’t, then it’ll either give aNameErroror fall back on whatever value it happened to be last time, and you’ve got no guarantee that’s going to be in a sensible order.If I simply use
start_exon = group.index[group.TYPE=='exon'][0]by itself, then I getI have no idea if those values are meaningful, but it doesn’t seem to be skipping anything.