Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8166675
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T19:57:30+00:00 2026-06-06T19:57:30+00:00

I have written a Perl script for the following bioinformatics question, but unfortunately there

  • 0

I have written a Perl script for the following bioinformatics question, but unfortunately there is a problem with the output.

Question

1) From a file of 40,000 unique sequences, unique meaning the sequence id numbers, extract the following pattern

 $gpat = [G]{3,5}; $npat = [A-Z]{1,25};<br>
 $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;  

2) For each sequence, find if $pattern occurs between the values of

  • 0-100
  • 100-200
  • 200-300
  • …
  • 900-1000
  • 1000

If a certain sequence is <1000 characters long, even then the division must be maintained i.e. 0-100,100-200 etc.

The Issue

The main issue I am having is with counting the number of times $pattern occurs for each sequence subdivision and then adding its count for all the sequences.

For example, for sequence 1, say $pattern occurs 5 times at a length >1000. For sequence 2, say $pattern occurs 3 times at length>1000. Then total count should be 5+3 =8.

Instead, my result is coming like : (5+4+3+2+1) + (3+2+1) = 21 i.e. a cumulative total.

I am facing the same issue with the count for the first 10 subdivisions of 100 characters each.

I would be grateful if a correct code could be provided for this calculation.

The code I have written is as under. It is heavily derived from Borodin’s answer to one of my previous questions here : Perl: Search a pattern across array elements

His answer is here: https://stackoverflow.com/a/11206399/1468737

The Code :

use strict;
use warnings;

my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; 
my $regex = qr/$pattern/i;

open my $fh, '<', 'small.fa' or die $!;

my ($id, $seq); 
my @totals = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0); #intialize the @total arrays...  
#..it should  contain 10 parts for 10 divisions upto 1000bp
my @thousandcounts =(0); #counting total occurrences of $pattern at >1000 length

while (<$fh>) {
  chomp;

  if (/^>(\w+)/) {
    process_seq($seq) if $id;
    $id = $1;
    $seq = '';
    print "$id\n";
  }
  elsif ($id) {
    $seq .= $_;
    process_seq($seq) if eof;
  }
}

print "Totals : @totals\n";
print "Thousand Counts total : @thousandcounts\n";

##**SUBROUTINE**    

sub process_seq {

  my $sequence = shift @_;   
  my $subseq = substr $sequence,0,1000;
  my $length = length $subseq;
  print $length,"\n";

  if ($length eq 1000) {

  my @offsets = map {sprintf '%.0f', $length * $_/ 10} 1..10;
  print "Offsets of 10 divisions: @offsets\n";

  my @counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
  my @count = (0); 

     while ($sequence =~ /$regex/g) {
     my $place = $-[0];
     print $place,"\n\n"; 

        if ($place <=1000){
        for my $i (0..9) { 
        next if $place >= $offsets[$i];                   
        $counts[$i]++;                                    
        last;
        }       

     }
      print "Counts : @counts\n\n";

      $totals[$_] += $counts[$_] for 0..9; 



        if ($place >1000){

        for my $i(0){
        $count[$i]++;
        last;
        }




        } print "Count greater than 1000 : @count\n\n"; 

         $thousandcounts[$_] += $count[$_] for 0;


  } 

} 

   #This region of code is for those sequences whose total length is less than 1000
   #It is working great ! No issues here
   elsif ($length != 1000) {

    my $substr = join ' ', unpack '(A100)*', $sequence;

    my @offsets = map {sprintf '%.0f', $length * $_/ ($length/100)} 1..10;
    print "Offsets of 10 divisions: @offsets\n";

    my @counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0,);

       while ($sequence =~ /$regex/g) {
       my $place = $-[0];
       print "Place : $place","\n\n"; 

         for my $i (0..9) { 
         next if $place >= $offsets[$i];                   
         $counts[$i]++;                                    .
         last;
        }
      }
       print "Counts : @counts\n\n";

       $totals[$_] += $counts[$_] for 0..9;

  }


}#subroutine ends

I am also attaching a small segment of the file I am working with. This one is titled small.fa and I have been experimenting with this file only before moving onto to the bigger file containing >40,000 sequences.

>NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
>NR_002714 1
gttatacatctctaccattacctagcctgaaaagccacctcagattcagc
caacaagtaagtgggcattacaggagaagggtacctttcacaagggctgt
aatctaaaatcttggggaagatacagcgtcatctgtccaagaggtgtcag
cagtaacgaagcctcagtagaagccaaagttattttggattactgagcct
gtatagtttccagattctcaagagaaatatatgggaatgtagatatctca
gaggaccttcctgctgtcaggaattcagaggaggaaataaggaaggtaat
aggtgctctgctctcattctctcaaaccctcttccctgtgttttcctata
gagattgctgatttgctccttaagcaagagattcactgctgctcagcatg
gctcagaccaactcatgcttcatgctgatctcctgcctgatgttcctgtc
tctgagccaaggtgagattgttttccccacacatacctcccacaacccca
gccctgaagccctcactctatcctcatgcatatgagttcacttgagaaaa
agcagagtcaagttcaggggttgttttgtgttgttcagtgatatttattg
ctgatctcatcccattcaaaaacatcctgacctccctaaggagttagaga
tggaacttagcataaccctttatcagtgaccactgcagttggcattggtt
tgtcatattaacactactcatgatgggggtgttgaggatgtctgtttgta
gacagtcattagtggaatggggaactgaggggagctttgtgtgtagagaa
actggacaggcttgagaaagaagcctcagtccttcaaggaagaaaaagcc
ataagtaaaagggacaatggggacacttttcatgagcctattcattgtgt
gctcttgtcttgagcaaagacatcttgagagcctataggtaagatgcaga
agggcagaagtgaccaatcgcttcgtgacctataggatccttctattcct
ataaagaatcctcagaagctcctacctcatattttagcctttaccttgcc
ctgagggtctttcttaattgtctctcttttcccaggacaggaggcccatg
ctgagttgcccaaggcccagatcagctgcccagaaggcaccagtgcctaa
ggctcccactgctactactttaatgaagagcatgagacctgggtttatgc
agatgtgagtgaggagagcagtgtgggaagggaggctcacgaagggaggg
gaagctgccactctccagtgtgttcagtggctgatatgagatgagactaa
tcccctccctatccaatcatcagcccaaaactttccaatctactttatcc
catcattcagcacagagatgctggtggtcagtgacagcatcatcagggac
atttctgtgctgtcctttttctgttacatcctctgggagggctcaatatg
tctcccacactttcctccttcactgagtgctccattttcttctccaacag
ctctactgccagaacatgaattcaggtaacctggtgtctgtgctcaccca
ggctgagggtgcctttgtggcttcgctgattaaagagagtggcaccaagg
atagcaatgtctggattggcctccatgacccccaccggatcagtctgctg
catcttctacctcctgattatcaggttccagagggtctgatgtctggcac
ctcaagcatcagtttttactatattatgataaaagcaacctctctataaa
tcatataatgtaaaggatatcaaggttctccataggttcttcgagataag
cttaaagctgaatttcctgtgtgtttcaggcattcacagataaactcatt
ctctgtacttctagggtagcatctttatgtatctattatgtacctcttat
ctattgtgttatcatctctgttatagaagagccttctgtagaccatatag
aaaaagattatagaggaggagaatctactgctggcaattgggaaccgcaa
ggtatactaaataatatatcaacaactaatggccatctaatgctatgctg
gatatgaacttttggggcctcaggaaagaaaaaccaggaactagtttcaa
taatgaggtgtcatggttccctgtggcaaatttagaacgcttatcgtttg
gcaggacacagagaggtaggtgaacattccaggaaagaagcagcttagag
aaaatgtggaggaaataatatgacacttagagaaaaaggaaggtttattc
ttgtcttatgtcttgacctgtttctgagtgcgaacacaaaccaggtgttt
ctgtctctttctgagtcacgtctgcccctgttctggcccttccccatcta
gaactgccattatcagtggagtagtgggtccctggtctcctacaaatcct
gggacattggatccccaagctgtgccaatactgcctactgtgctagcctg
acttcaagctcaggtgaggggcacagaatccacacacttattgccatcct
ctcctatttatctctgaggatcgaccggggactgggatagaggaagggtg
agctcctcattcaggaaatagaggagtgtttcctctttatttttgctgag
tcctgcagccaggagggtaatacactctgatcccctcagtctgaatcttc
tcattgtcttataggattcaagaaatggaaggatgattcttgtaaggaga
agttctcctttgtttgcaagttcaaatactggaggcaattgtaaaatgga
cgtctagaattggtctaccagttactatggagtaaaagaattaaactgga
ccatctctctccatatcaatctggaccatctctcctctgctaaatttgca
tgactgatctttagtatctttacctacctcaatttctggagccctaaaca
ataaaaataaacatgtttcccccat
>NR_003569 1
ctgggacccacgacgacagaaggcgccgatggccgcgcctgctgagccct
gcgcggggcagggggtctggaaccagacagagcctgaacctgccgccacc
agcctgctgagcctgtgcttcctgagaacagcaggggtctgggtaccccc
catgtacctctgggtccttggtcccatctacctcctcttcatccaccacc
atggccggggctacctccggatgttccccactcttcaaagccaagatggt
gcttggattcgccctcatagtcctgtgtacctccagcgtggctgtcgctc
tttggaaaatccaacagggaacgcctgaggccccagaattcctcattcat
cctactgtgtggctcaccacgatgagcttcgcagtgttcctgattcacac
caagaggaaaaagggagtccagtcatctggagtgctgtttggttactggc
ttctctgctttgtcttgccagctaccaacgctgcccagcaggcctccgga
gcgggcttccagagcgaccctgtccgccacctgtccacctacctatgcct
gtctctggtggtggcacagtttgtgctgtcctgcctggcggatcaacccc
ccttcttccctgaagacccccagcagtctaacccctgtccagagactggg
gcagccttcccctccaaagccacgttctggtgggtttctggcctggtctg
gaggggatacaggaggccactgagaccaaaagacctctggtcgcttggga
gagaaaactcctcagaagaacttgtttcccggcttgaaaaggagtggatg
aggaaccgcagtgcagcccgggggcacaacaaggcaatagcatttaaaag
gaaaggcggcagtggcatggaggctccagagactgagcccttcctacggc
aagaagggagccagtggcgcccactgctgaaggccatctggcaggtgttc
cattctaccttcctcctggggaccctcagcctcgtcatcagtgatgtctt
caggttcactgtccccaagctgctcagccttttcctggagtttattggtg
atcccaagcctccagcctggaagggctacctcctcgccgtgctgatgttc
ctctcggcctgcctgcaaacgctgtttgagcagcagaacatgtacaggct
caaggtgctgtagatgaggctgcggtcggccatcactggcctggtgtaca
gaaaggcatccacagcatatctgaagaaatattcagaagttaactaatct
cagatgatttcagcaggagtaaagaagagaaacagactcagaaatgccat
tacaacagttaattatgtcaaatttatcaccctgattgatcacgcagcat
taacctcaagaacgccaagccaagtttttttgacaaatgtgagccaaggt
ttccgaaaaactagcagatatgactgtgacttacaaaatggaaaaagtaa
acgagaaacacaatttgatatgatttaataaaagatttgtttccaccact
tctcctgggaacctcagcacattttctttccactgacagttattatctct
acctttattgaacaaagacacccggaacacagctgctgaggatcagtaaa
gaaaatcattcttttattaataagactgttattagcaggaaaaaaaaatc
catgtttgggagtttgcactgaagttacaggccattttgaagaaatatgg
ctgactagtgccaacattatttcaggcaatttcatgatcaaatgtcttat
taggttgtttaaaatttttatagagattgtaaatcagaactattttctat
ttgccctaaatatttagatgctacagggaaagcagatcaaattaaagggt
actgtgcacatttttttactgggaactcccagggatataaatcatttcgc
ctgcagcatggaattcttcagtacacatgcttgtggaaacattccacgct
ccgccagcacgctcattaaagtgatgatttgggttgcaacaacagtgcca
agtacttcctgtgttcaactggggaccatgtggcaagacccaaagcttcc
ccagagatcctatgggaataagttttttgagccaccatattccattattt
cagcctaaaataacaccatgggacaagaatcagaagacagaggagcagac
aaatgtgtgtagacatgctggaaggaatctttctttttagaaacagggtc
aatatctattaaactttaagatgtgtatctcttgacctggcagtttctgt
atttgagttttaacctactgatatacccatgcatgtgaataaagtatctt
cctgcatgtaacaggatatttaatgtaaccttgattatagttgcaaatgc
tgggaaacgatccaaatgtctttcaatatggcactgattaaataaattat
ggcacagtctcacaatgaaaaacaaatgtagccattaaacagaatgaaat
gggtctagctaaattgaaataggactacctctaagatatgttgttaaaaa
gaaaaaaaagaaagtgcagaggaacaagtatgataccattttgtattttt
taacatatgcaagcgtgattgtgcccacacagaatacctttgaaaataaa
ctcagtatttgcctcagtggataaaaacaagaaccagccttattttcact
gttatatcttttggtgccactttttgaactttttaccatatgtgcatatg
taactttctaaataaattttgtaaaaaaaaaaaaaaaaaa
>NR_002817 2
aactcggtctccactgcactgctggccagacgagggatgttattttgggc
agtgcatctggacttggttcaagtggcaccagccaaatccctgccttact
gacctctcccctggaggagcaggagcagtgctcaaggccgccctgggagg
gctgagaggcaggctctggactggggacacagggatagctgagccccagc
tgggggtggaagctgagccagggacagtcacagaggaacaagatcaagat
gcgctttaactgagaagcccccaaggcagaggctgagaatcagaagacat
ttcagcagacatctacaaatctgaaggacaaaacatggttcaagcatctg
ggcacaggcggtccacccgtggctccaaaatggtctcctggtccgtgata
gcaaagatccaggaaatatggtgcgaggaagatgagaggaagatggcgcg
agagttcctggccgagttcatgagcacatatgtcatgatggagtggctga
ccgggatgctccagctgtgtctcttcgccatcgtggaccaggagaacaac
ccagcactgccaggaacacacgcactggtgataggcatcctcgtggtcat
catcagggtgtaccatggcatgaacacaggatatgccatcaatccgtccc
gggacctgccccccccccccgcatcttcaccttcattgctggttggggca
aactggtcttcaggtactgcccctgcccaggcccattcctttgagatttt
ctgtggggcccctgtgtgttgaggtgtggggggtgatgtgaggggcagca
caggagggtcctgcagagcccccaggtggcctggggagcaggagtgagtc
ccaacatttccccaggccagtagagatacagatcctgcacctgcactgag
tgtcaaccctgtccctgagtcgggctgaggctgaccagggccccgggttg
ggggtgtttcctgggttagcctgaggatgactcctctgctcaaccagtct
tggcccgaggtggatgagggtgctgtcctgggcatcagccccctcagccg
gcctctgcctcttgcctgcagcgatggggagaacttgtggtgggtgccag
tggtggcaccacttctgggtgcctctctaggtggcatcatctacctggtc
ttcattggctccaccatcccacgggagcccctgaaattggaggactctgt
ggcatatgaagaccacgggataaccgtattgcccaagatgggatctcatg
aacccatgatctctccccttaccctcatctccgtgagccctgccaacaga
tcttcagtccaccctgccccacccttacatgaatccatggccctagagca
cttctaagcagagattatttgtgatcccatcccttccccaataaagagaa
gcttgtcccacagcagtacccccacttcctgggggcctcctgtggttggg
cttccctcctgggttcttccaggagctctagggctatgtcttagcccaag
gtgtagaggtgaggcacctcaagtctttcatgccctgggaactggggtgc
cccagggggagaatggggaagagctgacctgcgccctcagtaggaacaag
gtaagatgaaagaatgacagaaacagaatgagggattttcaggcaagggg
gaaggaagggcagttttggtgaaaggactgtagctgactggtggggggct
ggctttggaaatactttgaggggatcctgagactggactctagactctcc
cctggttgttcccttccccgagttctggccggttcttggaccagacaagg
catggcccaagaaggtagatcagaattttttagcctttttttcattagtg
ccttccctagtataattccagattttttttcttaatcacatgaaatttta
ataccacagatatactatacatctgtttatgttctgtatatgttctgtgc
tttatacgtaaaaaagagtaagattttttttcacctccccttttaagaat
cagttttaattcccttgagaatgcttgttatagattgaaggctggtaagg
ggttgggctcctctttcttcttcctggtgccagagtgctcccacatgaag
gaataggaaaggaagatgcaaagagggaaatccttcgaacacatgaagac
acaggaagaggcctcttagggctccaagggctccagggaagcagctgcag
aggttgggtggggtgaggggccaggatccactgaccctggggccaggcag
gaatcactctgttgcctggggctcagaaggcagtatcacccatggttcct
gtcattgctcatgtattttgcctttcaacaattattgtgcacctactgtg
tgcaggccctgcctggacactggggatgcgcagtggatgcactgggctct
gcctttgagggttgcagtttaatgggtgacaggtaattataaggaagaag
gtgagtgcagagtgggaggcttggaggctgtggggcttggggtgggggag
ctcacatccagcctctgggccaaggccaggaggcttcccagagcaggaga
cagagcagggtattgtggtggggggtgtcctttttggggctgggatctgc
actttacagtttgaggggatgggcagaggaggctgggcttcattctggag
gtggggacatggtgaggtgaggtttagaaagcacacctgagccgcagtgt
gtaggatgctggaaatggtggagatgggcctgcgaagagagtgctgggaa
gtgatgacccaggagcagcagccgggcacctaacaatgggtcagcaccgt
gggcgtggagacaaaggccgggattgatcaatacccgagaagtacaatgt
acaggacttgggctccatttggatggagtgggtgagggaggagtcagaaa
tggcttccgatttccagcttgggcctggggattggagatgtccccactga
gagtagggcacaagtgaggaaatggtttggagaggaagatgataagttac
atcatggatgtgctgagtctgagttgcctatgggacttggaatggggggt
ggcaaaaggtgtgtgatcttgagcaagatattcaactcttctgggccttg
gtcttctcatttgtaaaacggtgataagaatattacttcccatttgtgtt
gctgtgaatattaaatgcgctaccacatgt

Thank you for taking the time to go through my problem.

Any help and input would be deeply appreciated.

Thank you for taking the time to go through my problem!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T19:57:33+00:00Added an answer on June 6, 2026 at 7:57 pm

    This is pretty much the same as your previous problem except that the intervals are independent of the length of the sequence and so can be defined just once instead of changing them for every sequence.

    This program is a modification of my previous solution. As I described, it starts with a fixed set of values in @offsets from 100 to 1000 in steps of 100, and the final range > 1000 is terminated at 2E9 or 2 billion. This is close to the maximum positive 32-bit integer and serves to catch all offsets above 1000. I assume you won’t be dealing with sequences any bigger than this?

    The @totals and @counts arrays are initialised to zeroes with the same number of elements as the @offsets array.

    Otherwise the functionality is much as before.

    use strict;
    use warnings;
    
    use List::MoreUtils 'firstval';
    
    my $gpat = '[G]{3,5}';
    my $npat = '[A-Z]{1,25}';
    my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
    my $regex = qr/$pattern/i;
    
    open my $fh, '<', 'small.fa' or die $!;
    
    my @offsets = map $_*100, 1 .. 10;
    push @offsets, 2E9;
    my @totals = (0) x @offsets;
    
    my ($id, $seq);
    
    while (<$fh>) {
    
      chomp;
    
      if (/^>(\w+)/) {
        process_seq($seq) if $id;
        $id = $1;
        $seq = '';
        print "$id\n";
      }
      elsif ($id) {
        $seq .= $_;
        process_seq($seq) if eof;
      }
    }
    
    print "Total: @totals\n";
    
    
    
    sub process_seq {
    
      my $sequence = shift;
    
      my @counts = (0) x @offsets;
    
      while ($sequence =~ /$regex/g) {
        my $place = $-[0];
        my $i = firstval { $place < $offsets[$_] } keys @offsets;
        $counts[$i]++;
      }
    
      print "Counts: @counts\n\n";
      $totals[$_] += $counts[$_] for keys @totals;
    }
    

    output

    Running this program against your new data file small.fa produces

    Total: 1 1 0 0 0 0 0 1 0 1 10
    

    But using the data from the previous question, sample.fa is much more interesting

    Total: 5 4 1 0 0 2 2 1 0 0 1
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have written the following Perl script- use HTML::TreeBuilder; my $html = HTML::TreeBuilder->new_from_content(<<END_HTML); <span
I have written a simple perl script to read a line from a .csv
I have written a Perl script to read the configuration file and create CGI
Today, I have written a perl script, as the following code: my @files =
how to read the multiple values from XML file using perl script? i have
The script I have written outputs all lines from the file 2 that starts
In regards to the question I previously posted, I have written the following script
I have written a Perl script, I just want to give it to every
I have written the following code in Perl. I want to iterate through a
I have a web scraping application, written in OO Perl. There's single WWW::Mechanize object

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.