I am trying to perform some composition-based filtering on a large collection of strings (protein sequences).
I wrote a group of three subroutines in order to take care of it, but I’m running into trouble in two ways – one minor, one major. The minor trouble is that when I use List::MoreUtils ‘pairwise’ I get warnings about using $a and $b only once and them being uninitialized. But I believe I’m calling this method properly (based on CPAN’s entry for it and some examples from the web).
The major trouble is an error "Can't use string ("17/32") as HASH ref while "strict refs" in use..."
It seems like this can only happen if the foreach loop in &comp is giving the hash values as a string instead of evaluating the division operation. I’m sure I’ve made a rookie mistake, but can’t find the answer on the web. The first time I even looked at perl code was last Wednesday…
use List::Util;
use List::MoreUtils;
my @alphabet = (
'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'
);
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
my %counts = ();
foreach my $chr (@alphabet) {
$counts{$chr} = ( $_[0] =~ tr/$chr/$chr/ );
}
$counts{'gap'} = ( $_[0] =~ tr/$gapchr/$gapchr/ );
return %counts;
}
# Takes a sequence and returns letter => fractional composition pairs as a hash.
sub comp {
my %comp = getcounts( $_[0] );
foreach my $chr (@alphabet) {
$comp{$chr} = $comp{$chr} / ( length( $_[0] ) - $comp{'gap'} );
}
return %comp;
}
# Takes two sequences and returns a measure of the composition difference between them, as a scalar.
# Originally all on one line but it was unreadable.
sub dcomp {
my @dcomp = pairwise { $a - $b } @{ values( %{ comp( $_[0] ) } ) }, @{ values( %{ comp( $_[1] ) } ) };
@dcomp = apply { $_ ** 2 } @dcomp;
my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
Much appreciation for any answers or advice!
%{ $foo }will treat$fooas a hash reference and dereference it; similarly,@{}will dereference array references. Sincecompreturns a hash as a list (hashes becomes lists when passed to and from functions) and not a hash reference, the%{}is wrong. You could potentially leave off the%{}, butvaluesis a special form and needs a hash, not a hash passed as a list. To pass the result ofcomptovalues,compneeds to return a hash ref that then gets dereferenced.There’s another problem with your
dcomp, namely that the order ofvalues(as the documentation says) “are returned in an apparently random order”, so the values passed to thepairwiseblock aren’t necessarily for the same character. Instead ofvalues, you can use hash slices. We’re now back tocompreturning a hash (as a list).This doesn’t address what happens if a character appears in only one of
$_[0]and$_[1].uniqleft as an exercise for the reader.