For the Perl code below, I need to increase its efficiency since it’s taking hours to process the input files (which contain millions of lines of data). Any ideas on how I can speed things up?
Given two files, I want to compare the data and print those lines that match and those that don’t. Please note that two columns need to be compared interchangeably.
For example,
input1.txt
A B
C D
input2.txt
B A
C D
E F
G H
Please note:
Lines 1 and 2 match (interchangeably); Lines 3 and 4 don’t match
Output:
B A match
C D match
E F don't match
G H don't match
Perl code:
#!/usr/bin/perl -w
use strict;
use warnings;
open INFH1, "<input1.txt" || die "Error\n";
open INFH2, "<input2.txt" || die "Error\n";
chomp (my @array=<INFH2>);
while (<INFH1>)
{
my @values = split;
next if grep /\D/, @values or @values != 2;
my $re = qr/\A$values[0]\s+$values[1]\z|\A$values[1]\s+$values[0]\z/;
foreach my $temp (@array)
{
chomp $_;
print "$_\n" if grep $_ =~ $re, $temp;
}
}
close INFH1;
close INFH2;
1;
Any ideas on how to increase the efficiency of this code is highly appreciated. Thanks!
If you have enough memory, use a hash. If symbols do not occur multiple times in input1.txt (i.e. if
A Bis in the file,A Xis not), the following should work:Update:
For repeated values, I would use a hash of hashes. Just sort the symbols, the first one will be the key in the large hash, the second one will be the key in the subhash: