Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 876797
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T11:29:33+00:00 2026-05-15T11:29:33+00:00

Edit: solution added. Hi, I currently have some working albeit slow code. It merges

  • 0

Edit: solution added.

Hi, I currently have some working albeit slow code.

It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:

"one,two,,four,42"

and file 2 has this line;

"one,,three,,42"

where in 0 indexed $position = 4 has the primary key = 42;

then the sub: merge_file($file1,$file2,$outputfile,$position);

will output a file with the line:

"one,two,three,four,42";

Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)

There are about 1 million lines in each file.

Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.

Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)

It is probably best explained with a code:

sub merge_file2{
 my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
 print "merging: \n$file1 and \n$file2, to: \n$out\n";
 my $OUTSTRING = undef;

 my %line_for;
 my @file1array;
 open FILE1, "<$file1";
 print "$file1 opened\n";
 while (<FILE1>){
      chomp;
      $line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
      $file1array[$.] = $_; #store line in file1array.
 }
 close FILE1;
 print "$file2 opened - merging..\n";
 open FILE2, "<", $file2;
 my @from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
 while (<FILE2>){
      print "$.\n" if ($.%1000) == 0;
      chomp;
      my @array1 = ();
      my @array2 = ();
      my @array2 = split /,/, $_; #split 2nd csv line by commas

      my @array1 = split /,/, $file1array[$line_for{$array2[$position]}];
      #                            ^         ^                  ^
      # prev line  lookup line in 1st file,lookup hash,     pos of key
      #my @output = &merge_string(\@array1,\@array2); #merge 2 csv strings (old fn.)

      foreach(@from1to2){
           $array2[$_] = $array1[$_];
      }
      my $outstring = join ",", @array2;
      $OUTSTRING.=$outstring."\n";
      delete $line_for{$array2[$position]};
 }
 close FILE2;
 print "adding rest of lines\n";
 foreach my $key (sort { $a <=> $b } keys %line_for){
      $OUTSTRING.= $file1array[$line_for{$key}]."\n";
 }

 print "writing file $out\n\n\n";
 write_line($out,$OUTSTRING);
}

The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? 🙂 Thanks in advance.


Solution:

sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;

my (@file1,@file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
    if ($.==1){
        $header = $_;
        next;
    }
    print "$.\n" if ($.%100000) == 0;
    chomp;
    push @file1, [split ',', $_];
}
close FILE1;

open FILE2, "<$file2" or die;
while (<FILE2>){
    next if $.==1;
    print "$.\n" if ($.%100000) == 0;
    chomp;
    push @file2, [split ',', $_];
}
close FILE2;

print "sorting files\n";
my @sortedf1 = sort {$a->[$position] <=> $b->[$position]} @file1;
my @sortedf2 = sort {$a->[$position] <=> $b->[$position]} @file2;   
print "sorted\n";
@file1 = undef;
@file2 = undef;
#foreach my $line (@file1){print "\t [ @$line ],\n";    }

my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
    my $key1 = $sortedf1[$i][$position];
    my $key2 = $sortedf2[$j][$position];
    if ($key1 eq $key2){
        foreach(0..$hsize){ #header size.
            $sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
        }
        $i++;
        $j++;
    }
    elsif ( $key1 < $key2){
        push(@sortedf2,[@{$sortedf1[$i]}]);
        $i++;
    }
    elsif ( $key1 > $key2){ 
        $j++;
    }
}

#foreach my $line (@sortedf2){print "\t [ @$line ],\n"; }

print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(@sortedf2){
    print OUT (join ",", @{$_})."\n";
}
close OUT;

}

Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! 🙂

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T11:29:34+00:00Added an answer on May 15, 2026 at 11:29 am

    Two techniques come to mind.

    1. Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.

    2. First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Intro: EDIT: See solution at the bottom of this question (c++) I have a
Edit: I've got the solution and have described it a bit more at the
I've been having an issue with URL Rewriting and Postbacks. Edit: Currently using IIS
I have some problems adding my shader-glsl files from one xcode project to another.
I'm currently using jenkins/hudson for continuous integration a large mostly C++ project. We have
I'm currently working on an Intranet application project, using ASP.NET MVC 3. One of
I am using JMeter to do some load testing on a SOAP webservice. Currently
EDIT: Database names have been modified for simplicity I'm trying to get some dynamic
EDIT : proper solution: void add(Student s) { if(root == null) root = new
Solution Edit: Turns out you can't use the PHP SDK to return the correct

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.