I have this two strings of equal length, which I need to compare.
I want to find overlap base(.) and internal gap (*). Below is the example:
------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
................**.................
Number of overlap = 33.
Number of internal gap = 2.
I have no problem finding the number of overlap. But I have problem
finding internal gap. Below is the current code I have. It is horribly slow.
In principle I need to compute millions of such pairs.
#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";
print "$s1\n";
print "$s2\n";
my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);
my $ovlp_basecount = 0;
my $internal_gap = 0;
foreach my $si ( 0 .. length($s1) ) {
my $base1 = substr($s1,$si,1);
my $base2 = substr($s2,$si,1);
# Overlap
if ( $base{$base1} && $base{$base2} ) {
$ovlp_basecount++;
}
# Not sure how to compute internal gap
}
print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";
Please advice how can I find internal gap and overlap efficiently.
You can use a bitwise OR on the strings to find the the areas in one string that overlap blank areas in the other. This process also has the effect of revealing the overlap by converting non-overlapping characters to lower case, thus making finding the overlap quite simple too:
Prints:
For more information on string bitwise operations:
http://teaching.idallen.com/cst8214/08w/notes/bit_operations.txt