I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.
Here is an example of input file format:
HWI-AT555:86:D0:6:2208:13551:55125 122 chr1 77028 255 94M555N7M * 0 0 GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&& NM:i:3 XS:A:+ NH:i:1
The important part is the sequence read id, which is the first column (ie HWI-….55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.
Here is what I have so far:
unless (@ARGV == 2) {
print "Use as follows: perl program.pl in1.file in2.file\n";
die;
}
my $in1 = $ARGV[0];
my $in2 = $ARGV[1];
open ONE, $in1;
open TWO, $in2;
my %hash1;
my @hit;
while (<ONE>){
chomp;
my @hit = split(/\t/, $_);
$hash1{$hit[0]}=1;
}
close ONE;
my @col;
while (<TWO>){
chomp;
my @col = split(/\t/, $_);
if ($col[0] =~ /^H/){ #only valid sequence read lines start with "H"
print "$col[0]\n" if defined($hash1{$_});
}
}
close TWO;
So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.
I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn’t make something clear enough.
Initialize your
%hash1with zeros instead of ones:Then, in your second loop, you can increment
$hash1{$col[0]}:There’s no need to check
$col[0] =~ /^H/since%hash1will only have entries for valid sequences, so you can just do anexistscheck on the hash. And you want to look at$hash1{$col[0]}rather than$hash1{$_}since you’re only storing the first part of the lines in your first loop,$_will have the whole line. Furthermore, if you’re just grabbing the first field of each line you don’t need thechompcalls but they do no harm so you can keep them if you want.This leaves you with the all the repeated entries in
%hash1as entries with non-zero values and you cangrepthose out:And then display them with their counts:
You could also check the counts while displaying the matches: