I’m trying to compare two strings and as output I would like a count

Question

0

Asked: June 8, 20262026-06-08T03:07:26+00:00 2026-06-08T03:07:26+00:00

I’m trying to compare two strings and as output I would like a count

0

I’m trying to compare two strings and as output I would like a count of consecutive identical characters, and if the character is different, just the char from the second string. I have a working recursive implementation, but I can’t figure out how to add consecutive counts together

Code:

use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse  = 1;

my $str1 = "aaaaaaaaaaaabbbbbbbbbbbccccccccdddddddddddeeeefffffff";
my $str2 = "aaaaaaaaaaaabbbbbbbbbbbccccccccxxxxxxxddxxeeeefffffff";

sub find_diff {
    my ( $a, $b ) = @_;
    my @rtn = ();
    my $len = length $a;
    my $div = $len / 2;
    if ( $div < 1 ) {
        return $b;
    }
    my $a_1 = substr $a, 0, $div;
    my $b_1 = substr $b, 0, $div;
    if ($a_1 eq $b_1) {
         push @rtn, length $a_1;
    }
    else {
        push @rtn, find_diff( $a_1, $b_1 );
    }
    my $a_2 = substr $a, $div;
    my $b_2 = substr $b, $div;
    if ($a_2 eq $b_2) {
        push @rtn, length $a_2;
    }
    else {
        push @rtn, find_diff( $a_2, $b_2 );
    }
    return @rtn;
}

print Data::Dumper::Dumper( [ find_diff('xaabbb', 'aaabbc' ) ] ) . "\n";
print Data::Dumper::Dumper( [ find_diff('aaabbb', 'aaabbc' ) ] ) . "\n";
print Data::Dumper::Dumper( [ find_diff( $str1, $str2 ) ] ) . "\n";

Output:

['a',2,1,1,'c']
[3,1,1,'c']
[26,3,1,1,'x','x','x','x','x','x','x',1,1,'x','x',4,7]

Desired Output:

['a',4,'c']
[5,'c']
[31,'x','x','x','x','x','x','x',2,'x','x',11]

Of course I can split the characters into an array with unpack and then count consecutive matches fairly easily, but I want to try a divide-and-conquer approach so I can compare performance.

Thanks!

Edit — Managed to solve it in the recursive case by returning a nested array and then reducing. It’s suprisingly not that slow:

sub find_diff {
    my ( $a, $b ) = @_;
    my @rtn = ();
    my $len = length $a;
    if ( $len < 2 ) {
        return [$b, 0];
    }
    my $div = $len / 2;
    my $a_1 = substr $a, 0, $div;
    my $b_1 = substr $b, 0, $div;
    if ($a_1 eq $b_1) {
        push @rtn, [length $a_1, 1];
    }
    else {
        push @rtn, find_diff( $a_1, $b_1 );
    }
    my $a_2 = substr $a, $div;
    my $b_2 = substr $b, $div;
    if ($a_2 eq $b_2) {
        push @rtn, [length $a_2, 1];
    }
    else {
        push @rtn, find_diff( $a_2, $b_2 );
    }
    return @rtn;
}
sub compress_string {
    my ($a, $b) = @_;
    my @list = find_diff($a, $b);
    my $acc = 0;
    my @result = ();
    foreach my $item (@list) {
        if ( $item->[1] ) {
            $acc += $item->[0];
        } else {
            push @result, if $acc;
            push @result, $item->[0];
            $acc = 0;
        }
    }
    push @result, $acc if $acc;
    return @result;
}

Results match what I want.

Update – Performance Stats

this is really interesting. Using unpack( 'C*', $string) is insanely fast and I think it’s why my iterative version is so speedy. The speed advantage of recursive comes out with the longer string (434 chars)

                         Rate short_recurse_borodin short_recurse short_array_borodin short_array_sodved short_array
short_recurse_borodin  6944/s                    --          -31%                -36%               -73%        -84%
short_recurse         10091/s                   45%            --                 -8%               -61%        -76%
short_array_borodin   10929/s                   57%            8%                  --               -57%        -74%
short_array_sodved    25707/s                  270%          155%                135%                 --        -40%
short_array           42553/s                  513%          322%                289%                66%          --
                      Rate mid_array_borodin mid_recurse_borodin mid_string mid_array_sodved mid_array
mid_array_borodin   1418/s                --                -28%       -56%             -65%      -82%
mid_recurse_borodin 1972/s               39%                  --       -39%             -52%      -76%
mid_recurse         3226/s              127%                 64%         --             -21%      -60%
mid_array_sodved    4082/s              188%                107%        27%               --      -49%
mid_array           8065/s              469%                309%       150%              98%        --
                       Rate long_array_borodin long_array_sodved long_recurse_borodin long_array long_string
long_array_borodin    172/s                 --              -67%                 -80%       -85%        -89%
long_array_sodved     513/s               199%                --                 -40%       -55%        -67%
long_recurse_borodin  854/s               397%               66%                   --       -25%        -45%
long_array           1142/s               564%              122%                  34%         --        -26%
long_recurse         1546/s               800%              201%                  81%        35%          --

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T03:07:28+00:00

Thanks to Borodin and Sodved I have improved my solution to the point that it’s pretty fast. Since the strings I am comparing are log-messages that are almost identical apart from changing values, using a recursive solution eliminates a huge amount of work.

As Sodved mentioned, there would not be a similar gain in C since I would still have to do a character-by-character comparison.

What it does now is check that the length of the string is below a certain threshold, and if so, fall back on the array comparison.

Performance looks like this:

                        Rate          long_recurse long_recurse_fallback
long_recurse          1613/s                    --                  -18%
long_recurse_fallback 1961/s                   22%                    --

Here is my final code (with the test strings removed, they’re real log messages):

use strict;
use warnings;
use Data::Dumper;
use Benchmark qw(cmpthese);
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse  = 1;

my $str1 = "aaaaaaaaaaaabbbbbbbbbbbccccccccdddddddddddeeeefffffff";
my $str2 = "aaaaaaaaaaaabbbbbbbbbbbccccccccxxxxxxxddxxeeeefffffff";

sub find_diff {
    my ( $a, $b, $minlen ) = @_;
    my $len = length $a;
    if ($len < $minlen) {
        return compress_unpack_ary( $a, $b );
    }
    if ( $len < 2 ) {
        return [ord($b), 0];
    }
    my @rtn = ();
    my $div = $len / 2;
    my $a_1 = substr $a, 0, $div;
    my $b_1 = substr $b, 0, $div;
    if ($a_1 eq $b_1) {
        push @rtn, [length $a_1, 1];
    }
    else {
        push @rtn, find_diff( $a_1, $b_1, $minlen );
    }
    my $a_2 = substr $a, $div;
    my $b_2 = substr $b, $div;
    if ($a_2 eq $b_2) {
        push @rtn, [length $a_2, 1];
    }
    else {
        push @rtn, find_diff( $a_2, $b_2, $minlen );
    }
    return @rtn;
}

sub compress_string {
    my ($a, $b, $minlen) = @_;
    my @list = find_diff($a, $b, $minlen);
    my $acc = 0;
    my @result = ();
    foreach my $item (@list) {
        if ( $item->[1] ) {
            $acc += $item->[0];
        } else {
            while ( $acc > 127 ) {
                push @result, 255;
                $acc -= 127;
            }
            push @result, $acc + 128 if $acc;
            push @result, $item->[0];
            $acc = 0;
        }
    }
    while ( $acc > 127 ) {
        push @result, 255;
        $acc -= 127;
    }
    push @result, $acc + 128 if $acc;
    return pack('C*', @result);
}
sub compress_unpack_ary {
    my ( $a, $b ) = @_;
    my @orig       = unpack('C*', $a);
    my @new        = unpack('C*', $b);
    my @nonmatches = ();
    my $count      = 0;
    my $repeats    = 0;
    while ( $count < scalar @new ) {
        if ( $orig[$count] and $new[$count] == $orig[$count] ) {
            $repeats++;
        }
        elsif ( $repeats == 1 ) {
            push @nonmatches, [ $new[$count - 1], 0], [$new[$count], 0];
            $repeats = 0;
        }
        elsif ( $repeats > 1 ) {
            push @nonmatches, [$repeats, 1];
            $repeats = 0;    # reset counter
            push @nonmatches, [$new[$count], 0];
        }
        else {
            push @nonmatches, [$new[$count], 0];
        }
        $count++;
    }
    if ( $repeats > 0 ) {
        push @nonmatches, [$repeats, 1];
    }
    return @nonmatches;
}
print Data::Dumper::Dumper( [ compress_string( $str1, $str2, 20 ) ] ) . "\n";
print Data::Dumper::Dumper( [ compress_string( $str1, $str2, 0 ) ] ) . "\n";
print Data::Dumper::Dumper( [ compress_string( $long_a, $long_b, 20 ) ] ) . "\n";
print Data::Dumper::Dumper( [ compress_string( $long_a, $long_b, 0 ) ] ) . "\n";

cmpthese(1000, {
        'long_recurse' => sub { compress_string($long_a, $long_b, 0 ) },
        'long_recurse_fallback' => sub { compress_string($long_a, $long_b, 20 ) },
        });

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to compare two strings and as output I would like a count

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply