I’m trying to compare two strings and as output I would like a count of consecutive identical characters, and if the character is different, just the char from the second string. I have a working recursive implementation, but I can’t figure out how to add consecutive counts together
Code:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse = 1;
my $str1 = "aaaaaaaaaaaabbbbbbbbbbbccccccccdddddddddddeeeefffffff";
my $str2 = "aaaaaaaaaaaabbbbbbbbbbbccccccccxxxxxxxddxxeeeefffffff";
sub find_diff {
my ( $a, $b ) = @_;
my @rtn = ();
my $len = length $a;
my $div = $len / 2;
if ( $div < 1 ) {
return $b;
}
my $a_1 = substr $a, 0, $div;
my $b_1 = substr $b, 0, $div;
if ($a_1 eq $b_1) {
push @rtn, length $a_1;
}
else {
push @rtn, find_diff( $a_1, $b_1 );
}
my $a_2 = substr $a, $div;
my $b_2 = substr $b, $div;
if ($a_2 eq $b_2) {
push @rtn, length $a_2;
}
else {
push @rtn, find_diff( $a_2, $b_2 );
}
return @rtn;
}
print Data::Dumper::Dumper( [ find_diff('xaabbb', 'aaabbc' ) ] ) . "\n";
print Data::Dumper::Dumper( [ find_diff('aaabbb', 'aaabbc' ) ] ) . "\n";
print Data::Dumper::Dumper( [ find_diff( $str1, $str2 ) ] ) . "\n";
Output:
['a',2,1,1,'c']
[3,1,1,'c']
[26,3,1,1,'x','x','x','x','x','x','x',1,1,'x','x',4,7]
Desired Output:
['a',4,'c']
[5,'c']
[31,'x','x','x','x','x','x','x',2,'x','x',11]
Of course I can split the characters into an array with unpack and then count consecutive matches fairly easily, but I want to try a divide-and-conquer approach so I can compare performance.
Thanks!
Edit — Managed to solve it in the recursive case by returning a nested array and then reducing. It’s suprisingly not that slow:
sub find_diff {
my ( $a, $b ) = @_;
my @rtn = ();
my $len = length $a;
if ( $len < 2 ) {
return [$b, 0];
}
my $div = $len / 2;
my $a_1 = substr $a, 0, $div;
my $b_1 = substr $b, 0, $div;
if ($a_1 eq $b_1) {
push @rtn, [length $a_1, 1];
}
else {
push @rtn, find_diff( $a_1, $b_1 );
}
my $a_2 = substr $a, $div;
my $b_2 = substr $b, $div;
if ($a_2 eq $b_2) {
push @rtn, [length $a_2, 1];
}
else {
push @rtn, find_diff( $a_2, $b_2 );
}
return @rtn;
}
sub compress_string {
my ($a, $b) = @_;
my @list = find_diff($a, $b);
my $acc = 0;
my @result = ();
foreach my $item (@list) {
if ( $item->[1] ) {
$acc += $item->[0];
} else {
push @result, if $acc;
push @result, $item->[0];
$acc = 0;
}
}
push @result, $acc if $acc;
return @result;
}
Results match what I want.
Update – Performance Stats
this is really interesting. Using unpack( 'C*', $string) is insanely fast and I think it’s why my iterative version is so speedy. The speed advantage of recursive comes out with the longer string (434 chars)
Rate short_recurse_borodin short_recurse short_array_borodin short_array_sodved short_array
short_recurse_borodin 6944/s -- -31% -36% -73% -84%
short_recurse 10091/s 45% -- -8% -61% -76%
short_array_borodin 10929/s 57% 8% -- -57% -74%
short_array_sodved 25707/s 270% 155% 135% -- -40%
short_array 42553/s 513% 322% 289% 66% --
Rate mid_array_borodin mid_recurse_borodin mid_string mid_array_sodved mid_array
mid_array_borodin 1418/s -- -28% -56% -65% -82%
mid_recurse_borodin 1972/s 39% -- -39% -52% -76%
mid_recurse 3226/s 127% 64% -- -21% -60%
mid_array_sodved 4082/s 188% 107% 27% -- -49%
mid_array 8065/s 469% 309% 150% 98% --
Rate long_array_borodin long_array_sodved long_recurse_borodin long_array long_string
long_array_borodin 172/s -- -67% -80% -85% -89%
long_array_sodved 513/s 199% -- -40% -55% -67%
long_recurse_borodin 854/s 397% 66% -- -25% -45%
long_array 1142/s 564% 122% 34% -- -26%
long_recurse 1546/s 800% 201% 81% 35% --
Thanks to Borodin and Sodved I have improved my solution to the point that it’s pretty fast. Since the strings I am comparing are log-messages that are almost identical apart from changing values, using a recursive solution eliminates a huge amount of work.
As Sodved mentioned, there would not be a similar gain in C since I would still have to do a character-by-character comparison.
What it does now is check that the length of the string is below a certain threshold, and if so, fall back on the array comparison.
Performance looks like this:
Here is my final code (with the test strings removed, they’re real log messages):