Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8902227
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T01:33:31+00:00 2026-06-15T01:33:31+00:00

For example… I have two scripts for look if a (Multiple Sequence Alignment) MSA

  • 0

For example… I have two scripts for look if a (Multiple Sequence Alignment) MSA has more than 50 columns with less than 50% of gaps.

The first using BioPython takes 4.2 seconds in a MSA of 16281 sequences with 609 columns (PF00085 of Pfam in fasta format). [ getitem method of the Multiple Sequence Alignment object of Biopython consumes a lot of time ]

The second using a simple IO for generate a 2D Numpy Array with the MSA, takes only 1.2 second in the same Alignment.

I think that a Numpy approach to MSA objects can be more useful and faster. For example, you can use a boolean numpy array to select specific rows and columns. Actually deletion and selection of columns (for example for eliminates columns with more tha 50% of gaps) is very time consuming and not well implemented in Biopython. I think this can be useful to for a nx3 numpy array for PDB coordinates, too.

I have five ideas, maybe only one or two of them can be useful:

1 – Create a Seq and Multiple Sequence Alignment object (Bio.Align.MultipleSeqAlignment) based on numpy instead of str. This can be a problem for compatibility… Maybe it is NOT a good idea. I dont’t know.

2 – Create a faster method in Biopython for obtain numpy arrays versions from Biopython objects. I try to generate numpy arrays for Multiple Sequence Alignment object, but this do multiple calls to getitem method, and it’s more time consuming than using Biopython alone. But, maybe someone with more programming skills can do something better.

3 – Create a module for numpy or scipy with IO support for Alignments and PDB. Maybe the more simple and useful idea.

4 – Create another complete Bio module but based on numpy. Maybe inside of scipy or numpy.

5 – Like ideas 2 and 3, creating modules and methods for faster and efficient compatibility between Biopython and numpy objects.

What do you think? What of the ideas is better? Do you have some better idea? Can be possible do something? I want to colaborate with Biopython project… I think than integration with numpy can be a good start.

A lot of thanks 😉

P.D.: My two scripts…
The slow, based on Biopython:

#!/usr/bin/python2.7

from sys import argv
from Bio import AlignIO
aln = AlignIO.read(open(argv[1],"r"), "fasta")
longitud = aln.get_alignment_length()
if longitud > 150:
    corte = 0.5 * len(aln)
    j = 0
    i = 0
    while j<50 and i<longitud:    
        if aln[:,i].count("-") < corte:
            j += 1
        i += 1
    if j>=50:
        print argv[1]

And the fastest based on numpy array:

#!/usr/bin/python2.7

from sys import argv
import numpy as np

with open(argv[1],'r') as archivo:
    secuencias=[]
    identificadores=[]
    temp=[]
    for linea in archivo:
        if linea[0]=='>':
            identificadores.append(linea[1:].replace('\n',''))
            secuencias.append(list(temp))
            temp=""
        else:
            temp += linea.replace('\n','')
    secuencias.append(list(temp))

sec = np.array(secuencias[1:])
ide = np.array(identificadores)

if len(ide)>150:
    corte = len(ide) * 0.5
    if np.sum(np.sum(sec=='-',1) < corte) >= 50:
        print argv[1]
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T01:33:33+00:00Added an answer on June 15, 2026 at 1:33 am

    If you are going to be doing lots of operations on MSA objects where it is useful to treat them as arrays of characters, then I would just use Biopython’s AlignIO to load the alignment and then convert it into a NumPy array of characters. For example:

    import numpy as nump
    from Bio import AlignIO
    filename = "opuntia.aln"
    format = "clustal"
    alignment = AlignIO.read(filename, format)
    align_array = numpy.array([list(rec) for rec in alignment], numpy.character)
    

    That quick example could easily be added to the alignment object as a to_array method, or included in the tutorial. It is helpful?

    Granted you are still paying the overhead of all the object creation (Seq objects, SeqRecord objects, empty annotation dictionaries, the alignment object etc) but that is the downside to the AlignIO interface – it works on a relatively heavy object model. This isn’t really needed on simple formats like FASTA and Clustal, but is more useful with rich alignment formats like Stockholm.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Example: an Employee table with an optional DateOfBirth field can be normalized into two
Example from Mac OS X: [[aView animator] setFrame:NSMakeRect(100.0,100.0,300.0,300.0)]; I have tried something similar in
Example. I have an XML document: <document> <region type=type1>text of region1 </region> some simple
I see that some rss on xml have strange strings. For example, ... is
Example here: http://jsfiddle.net/KyW6c/2/ I have an ordered list. Each list item is a project
Example, I have billions of short phrases, and I want to clusters of them
Example I have: @test = Pakke.find([[4], [5]]) In my Pakke table I have a
Example I have this array: @packages = [2, [4, 2, 1], 4, [4,2], [5,
example: var arr = [one,two,three]; arr.forEach(function(part){ part = four; return four; }) alert(arr); The
Example I have: range = start.to_date..(end.to_date + 1.day) end and start are dates. How

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.