I have two tab-delimited genome sequence files (SAM format), and I would like to

Question

0

Asked: May 26, 20262026-05-26T21:29:58+00:00 2026-05-26T21:29:58+00:00

I have two tab-delimited genome sequence files (SAM format), and I would like to

0

I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.

Here is an example of input file format:

HWI-AT555:86:D0:6:2208:13551:55125       122     chr1    77028   255     94M555N7M       *       0       0       GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG   &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&&   NM:i:3  XS:A:+    NH:i:1

The important part is the sequence read id, which is the first column (ie HWI-….55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.

Here is what I have so far:

unless (@ARGV == 2) {
    print "Use as follows: perl program.pl in1.file in2.file\n";
    die;
}

my $in1 = $ARGV[0];
my $in2 = $ARGV[1];

open ONE, $in1;
open TWO, $in2;

my %hash1;
my @hit;    

while (<ONE>){
    chomp;
    my @hit = split(/\t/, $_);
    $hash1{$hit[0]}=1;
}
close ONE;

my @col;

while (<TWO>){
    chomp;
    my @col = split(/\t/, $_);
    if ($col[0] =~ /^H/){  #only valid sequence read lines start with "H"
        print "$col[0]\n" if defined($hash1{$_});   

    }
}
close TWO;

So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.

I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn’t make something clear enough.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:29:59+00:00

Initialize your %hash1 with zeros instead of ones:

while (<ONE>){
    chomp;
    my @hit = split(/\t/, $_);
    # Start them as "0" for "no duplicates".
    $hash1{$hit[0]} = 0;
}

Then, in your second loop, you can increment $hash1{$col[0]}:

while (<TWO>){
    chomp;
    my @col = split(/\t/, $_);
    # Increment the counter if %hash1 has what we're looking for.
    ++$hash1{$col[0]} if(exists($hash1{$col[0]}));
}

There’s no need to check $col[0] =~ /^H/ since %hash1 will only have entries for valid sequences, so you can just do an exists check on the hash. And you want to look at $hash1{$col[0]} rather than $hash1{$_} since you’re only storing the first part of the lines in your first loop, $_ will have the whole line. Furthermore, if you’re just grabbing the first field of each line you don’t need the chomp calls but they do no harm so you can keep them if you want.

This leaves you with the all the repeated entries in %hash1 as entries with non-zero values and you can grep those out:

my @dups = grep { $hash1{$_} > 0 } keys %hash1;

And then display them with their counts:

for my $k (sort @dups) {
    print "$k\t$hash1{$k}\n";
}

You could also check the counts while displaying the matches:

for my $k (sort keys %hash1) {
    print "$k\t$hash1{$k}\n" if($hash1{$k} > 0);
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two tab-delimited genome sequence files (SAM format), and I would like to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply