Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6821057
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T21:29:58+00:00 2026-05-26T21:29:58+00:00

I have two tab-delimited genome sequence files (SAM format), and I would like to

  • 0

I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.

Here is an example of input file format:

HWI-AT555:86:D0:6:2208:13551:55125       122     chr1    77028   255     94M555N7M       *       0       0       GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG   &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&&   NM:i:3  XS:A:+    NH:i:1

The important part is the sequence read id, which is the first column (ie HWI-….55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.

Here is what I have so far:

unless (@ARGV == 2) {
    print "Use as follows: perl program.pl in1.file in2.file\n";
    die;
}

my $in1 = $ARGV[0];
my $in2 = $ARGV[1];

open ONE, $in1;
open TWO, $in2;

my %hash1;
my @hit;    

while (<ONE>){
    chomp;
    my @hit = split(/\t/, $_);
    $hash1{$hit[0]}=1;
}
close ONE;

my @col;

while (<TWO>){
    chomp;
    my @col = split(/\t/, $_);
    if ($col[0] =~ /^H/){  #only valid sequence read lines start with "H"
        print "$col[0]\n" if defined($hash1{$_});   

    }
}
close TWO;

So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.

I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn’t make something clear enough.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T21:29:59+00:00Added an answer on May 26, 2026 at 9:29 pm

    Initialize your %hash1 with zeros instead of ones:

    while (<ONE>){
        chomp;
        my @hit = split(/\t/, $_);
        # Start them as "0" for "no duplicates".
        $hash1{$hit[0]} = 0;
    }
    

    Then, in your second loop, you can increment $hash1{$col[0]}:

    while (<TWO>){
        chomp;
        my @col = split(/\t/, $_);
        # Increment the counter if %hash1 has what we're looking for.
        ++$hash1{$col[0]} if(exists($hash1{$col[0]}));
    }
    

    There’s no need to check $col[0] =~ /^H/ since %hash1 will only have entries for valid sequences, so you can just do an exists check on the hash. And you want to look at $hash1{$col[0]} rather than $hash1{$_} since you’re only storing the first part of the lines in your first loop, $_ will have the whole line. Furthermore, if you’re just grabbing the first field of each line you don’t need the chomp calls but they do no harm so you can keep them if you want.

    This leaves you with the all the repeated entries in %hash1 as entries with non-zero values and you can grep those out:

    my @dups = grep { $hash1{$_} > 0 } keys %hash1;
    

    And then display them with their counts:

    for my $k (sort @dups) {
        print "$k\t$hash1{$k}\n";
    }
    

    You could also check the counts while displaying the matches:

    for my $k (sort keys %hash1) {
        print "$k\t$hash1{$k}\n" if($hash1{$k} > 0);
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have two data files in tab separated CSV format. The files are in
I have two tab delimited csv files (with headers) that I need to merge
I have two files one is a tab delimited and one is a csv
I have text files that are Tab delimited. I created a Schema.ini like so:
I have two tab delimited files: File 1: 12 rows and 1 column File
I have two tab separated files (please see the examples below): File 1 Java
I have two tab delimited files, of 3 columns and n number of rows.
I have two tab delimited files, file 1 contains identifiers and file 2 has
I have many large (~30 MB a piece) tab-delimited text files with variable-width lines.
in my website I have two files home.aspx and tab.aspx with code behind files

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.