I have another problem with my data set. Basically, there is a list of

Question

0

Asked: June 17, 20262026-06-17T23:33:54+00:00 2026-06-17T23:33:54+00:00

I have another problem with my data set. Basically, there is a list of

0

I have another problem with my data set. Basically, there is a list of genes with associated features including position numbs (columns 3 and 4) and strand orientation (+ or -). I am trying to do a calculation with the positions to make them relative to the start codon TYPE (second column) for each gene, rather than the entire genome (as it is now). The problem is that, the calculation is only performed on the + STRAND sequences, the – STRAND sequences are not showing up in the output. Below is a sample of the data set, my code, the output, and what I’ve tried.

Here’s the data set:

    GENE_ID TYPE    POS1    POS2    STRAND
PITG_00002  start_codon 10520   10522   -
PITG_00002  stop_codon  10097   10099   -
PITG_00002  exon    10474   10522   -
PITG_00002  CDS 10474   10522   -
PITG_00002  exon    10171   10433   -
PITG_00002  CDS 10171   10433   -
PITG_00002  exon    10097   10114   -
PITG_00002  CDS 10100   10114   -
PITG_00003  start_codon 38775   38777   +
PITG_00003  stop_codon  39069   39071   +
PITG_00003  exon    38775   39071   +
PITG_00003  CDS 38775   39068   +

Here is the code:

import numpy
import pandas
import pandas as pd
import sys

sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')
groups = data.groupby(['STRAND', 'GENE_ID'])

corrected = []

for (direction, gene_name), group in groups:
    ##print direction,gene_name
    if group.index[group.TYPE=='start_codon']:
        start_exon = group.index[group.TYPE=='exon'][0]
    if direction == '+':
        group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
        group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
    else:
        group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
        group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
    ##print group
    corrected.append(group)

Here is a sample of the output:

     + PITG_00003
    GENE_ID     TYPE         POS1   POS2   STRAND  POSA  POSB
8   PITG_00003  start_codon  38775  38777  +       1     3   
9   PITG_00003  stop_codon   39069  39071  +       295   297 
10  PITG_00003  exon         38775  39071  +       1     297 
11  PITG_00003  CDS          38775  39068  +       1     294

Previously I was getting an array value error (Tab delimited dataset ValueError truth of array with more than one element is ambiguous error) but that has been taken care of. So next I tried only doing this part:

import numpy
import pandas
import pandas as pd
import sys

##sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')#,
              #converters={'STRAND': lambda s: s[0]})
groups = data.groupby(['STRAND', 'GENE_ID'])

corrected = []

for (direction, gene_name), group in groups:
    print direction,gene_name

And the output printed out all the GENE_IDs and their STRAND symbol (+ or -), and it did it for both the + and – sequences. So somewhere below that it isn’t selecting any of the sequences with – in the STRAND column.

So I tried adding this to the original code:

if direction == '+':
    group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
    group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
elif direction == '-':
    group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
    group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
else:
    break
print group
# put into the result array
corrected.append(group)

and this is the very end of the output, it printed the first – and then froze for awhile before ending:

+
        GENE_ID     TYPE         POS1    POS2    STRAND  POSA  POSB
134991  PITG_23350  start_codon  161694  161696  +       516   518 
134992  PITG_23350  stop_codon   162135  162137  +       957   959 
134993  PITG_23350  exon         161179  162484  +       1     1306
134994  PITG_23350  CDS          161694  162134  +       516   956 
-

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T23:33:55+00:00

These lines seem weird to me:

if group.index[group.TYPE=='start_codon']:
    start_exon = group.index[group.TYPE=='exon'][0]

The first, I’m guessing, is simply trying to check to see whether the group has a start codon marker. But that doesn’t make sense for two reasons.

(1) If there’s only one start_codon entry and it’s the first, then the condition is actually false!

In [8]: group.TYPE == 'start_codon'
Out[8]: 
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: TYPE

In [9]: group.index[group.TYPE == 'start_codon']
Out[9]: Int64Index([0], dtype=int64)

In [10]: bool(group.index[group.TYPE == 'start_codon'])
Out[10]: False

Maybe you want any(group.TYPE == 'start_codon'), or (group.TYPE == 'start_codon').any() or sum(group.TYPE == 'start_codon') == 1 or something? But that can’t be right either, because

(2) Your code only works if start_exon is set. If it isn’t, then it’ll either give a NameError or fall back on whatever value it happened to be last time, and you’ve got no guarantee that’s going to be in a sensible order.

If I simply use start_exon = group.index[group.TYPE=='exon'][0] by itself, then I get

In [28]: for c in corrected:
   ....:     print c
   ....:     
       GENE_ID         TYPE   POS1   POS2 STRAND  POSA  POSB
8   PITG_00003  start_codon  38775  38777      +     1     3
9   PITG_00003   stop_codon  39069  39071      +   295   297
10  PITG_00003         exon  38775  39071      +     1   297
11  PITG_00003          CDS  38775  39068      +     1   294
      GENE_ID         TYPE   POS1   POS2 STRAND  POSA  POSB
0  PITG_00002  start_codon  10520  10522      -     1    -1
1  PITG_00002   stop_codon  10097  10099      -  -422  -424
2  PITG_00002         exon  10474  10522      -     1   -47
3  PITG_00002          CDS  10474  10522      -     1   -47
4  PITG_00002         exon  10171  10433      -   -88  -350
5  PITG_00002          CDS  10171  10433      -   -88  -350
6  PITG_00002         exon  10097  10114      -  -407  -424
7  PITG_00002          CDS  10100  10114      -  -407  -421

I have no idea if those values are meaningful, but it doesn’t seem to be skipping anything.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have another problem with my data set. Basically, there is a list of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply