Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9203851
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T23:33:54+00:00 2026-06-17T23:33:54+00:00

I have another problem with my data set. Basically, there is a list of

  • 0

I have another problem with my data set. Basically, there is a list of genes with associated features including position numbs (columns 3 and 4) and strand orientation (+ or -). I am trying to do a calculation with the positions to make them relative to the start codon TYPE (second column) for each gene, rather than the entire genome (as it is now). The problem is that, the calculation is only performed on the + STRAND sequences, the – STRAND sequences are not showing up in the output. Below is a sample of the data set, my code, the output, and what I’ve tried.

Here’s the data set:

    GENE_ID TYPE    POS1    POS2    STRAND
PITG_00002  start_codon 10520   10522   -
PITG_00002  stop_codon  10097   10099   -
PITG_00002  exon    10474   10522   -
PITG_00002  CDS 10474   10522   -
PITG_00002  exon    10171   10433   -
PITG_00002  CDS 10171   10433   -
PITG_00002  exon    10097   10114   -
PITG_00002  CDS 10100   10114   -
PITG_00003  start_codon 38775   38777   +
PITG_00003  stop_codon  39069   39071   +
PITG_00003  exon    38775   39071   +
PITG_00003  CDS 38775   39068   +

Here is the code:

import numpy
import pandas
import pandas as pd
import sys

sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')
groups = data.groupby(['STRAND', 'GENE_ID'])

corrected = []

for (direction, gene_name), group in groups:
    ##print direction,gene_name
    if group.index[group.TYPE=='start_codon']:
        start_exon = group.index[group.TYPE=='exon'][0]
    if direction == '+':
        group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
        group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
    else:
        group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
        group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
    ##print group
    corrected.append(group)

Here is a sample of the output:

     + PITG_00003
    GENE_ID     TYPE         POS1   POS2   STRAND  POSA  POSB
8   PITG_00003  start_codon  38775  38777  +       1     3   
9   PITG_00003  stop_codon   39069  39071  +       295   297 
10  PITG_00003  exon         38775  39071  +       1     297 
11  PITG_00003  CDS          38775  39068  +       1     294 

Previously I was getting an array value error (Tab delimited dataset ValueError truth of array with more than one element is ambiguous error) but that has been taken care of. So next I tried only doing this part:

import numpy
import pandas
import pandas as pd
import sys

##sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')#,
              #converters={'STRAND': lambda s: s[0]})
groups = data.groupby(['STRAND', 'GENE_ID'])

corrected = []

for (direction, gene_name), group in groups:
    print direction,gene_name

And the output printed out all the GENE_IDs and their STRAND symbol (+ or -), and it did it for both the + and – sequences. So somewhere below that it isn’t selecting any of the sequences with – in the STRAND column.

So I tried adding this to the original code:

if direction == '+':
    group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
    group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
elif direction == '-':
    group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
    group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
else:
    break
print group
# put into the result array
corrected.append(group)

and this is the very end of the output, it printed the first – and then froze for awhile before ending:

+
        GENE_ID     TYPE         POS1    POS2    STRAND  POSA  POSB
134991  PITG_23350  start_codon  161694  161696  +       516   518 
134992  PITG_23350  stop_codon   162135  162137  +       957   959 
134993  PITG_23350  exon         161179  162484  +       1     1306
134994  PITG_23350  CDS          161694  162134  +       516   956 
-
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T23:33:55+00:00Added an answer on June 17, 2026 at 11:33 pm

    These lines seem weird to me:

    if group.index[group.TYPE=='start_codon']:
        start_exon = group.index[group.TYPE=='exon'][0]
    

    The first, I’m guessing, is simply trying to check to see whether the group has a start codon marker. But that doesn’t make sense for two reasons.

    (1) If there’s only one start_codon entry and it’s the first, then the condition is actually false!

    In [8]: group.TYPE == 'start_codon'
    Out[8]: 
    0     True
    1    False
    2    False
    3    False
    4    False
    5    False
    6    False
    7    False
    Name: TYPE
    
    In [9]: group.index[group.TYPE == 'start_codon']
    Out[9]: Int64Index([0], dtype=int64)
    
    In [10]: bool(group.index[group.TYPE == 'start_codon'])
    Out[10]: False
    

    Maybe you want any(group.TYPE == 'start_codon'), or (group.TYPE == 'start_codon').any() or sum(group.TYPE == 'start_codon') == 1 or something? But that can’t be right either, because

    (2) Your code only works if start_exon is set. If it isn’t, then it’ll either give a NameError or fall back on whatever value it happened to be last time, and you’ve got no guarantee that’s going to be in a sensible order.

    If I simply use start_exon = group.index[group.TYPE=='exon'][0] by itself, then I get

    In [28]: for c in corrected:
       ....:     print c
       ....:     
           GENE_ID         TYPE   POS1   POS2 STRAND  POSA  POSB
    8   PITG_00003  start_codon  38775  38777      +     1     3
    9   PITG_00003   stop_codon  39069  39071      +   295   297
    10  PITG_00003         exon  38775  39071      +     1   297
    11  PITG_00003          CDS  38775  39068      +     1   294
          GENE_ID         TYPE   POS1   POS2 STRAND  POSA  POSB
    0  PITG_00002  start_codon  10520  10522      -     1    -1
    1  PITG_00002   stop_codon  10097  10099      -  -422  -424
    2  PITG_00002         exon  10474  10522      -     1   -47
    3  PITG_00002          CDS  10474  10522      -     1   -47
    4  PITG_00002         exon  10171  10433      -   -88  -350
    5  PITG_00002          CDS  10171  10433      -   -88  -350
    6  PITG_00002         exon  10097  10114      -  -407  -424
    7  PITG_00002          CDS  10100  10114      -  -407  -421
    

    I have no idea if those values are meaningful, but it doesn’t seem to be skipping anything.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have another puzzling problem. I need to read .xls files with RODBC. Basically
Following on from a question I posted yesterday about GUIs, I have another problem
I've been struggling with Zend_Navigation all weekend, and now I have another problem, which
I have another weird problem which I have not been able to solve in
another problem. I have already placed my Jython.jar into what my computer recognizes as
Okay, so I have another question on a prolog homework problem I am struggling
After getting a helpful answer here , I have run into yet another problem:
And another TreeView problem , should've used qt ;) I have a TreeView in
another ordered delivery problem. We have an orchestration which is bound to a send
I have a strange problem with integrating a ViewController into another ViewController . i

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.