I have an input file with the following format
ant,1
bat,1
bat,2
cat,4
cat,1
cat,2
dog,4
I need to aggregate the col2 for each key (column1) so the result is:
ant,1
bat,3
cat,7
dog,4
Other considerations:
- Assume that the input file is sorted
- The input file is pretty large (about 1M rows), so I don’t want to use an array and take up memory
- Each input line should be processed as we read it, and move to the next line
- I need to write the results to an outFile
- I need to do this in Perl, but a pseudo-code or algorithm would help just as fine
Thanks!
This is what I came up with… want to see if this can be written better/elegant.
open infile, outFile
prev_line = <infile>;
print_line = $prev_line;
while(<>){
curr_line = $_;
@prev_cols=split(',', $prev_line);
@curr_cols=split(',', $curr_line);
if ( $prev_cols[0] eq $curr_cols[0] ){
$prev_cols[1] += curr_cols[1];
$print_line = "$prev_cols[0],$prev_cols[1]\n";
$print_flag = 0;
}
else{
$print outFile "$print_line";
$print_flag = 1;
$print_line = $curr_line;
}
$prev_line = $curr_line;
}
if($print_flag = 1){
print outFile "$curr_line";
}
else{
print outFile "$print_line";
}
This short code affords you the chance to learn Perl’s excellent hash facility, as
%a. Hashes are central to Perl. One really cannot write fluent Perl without them.Observe incidentally that the code exercises Perl’s interesting autovivification feature. The first time a particular animal is encountered in the input stream, no count exists, so Perl implicitly assumes a pre-existing count of zero. Thus, the
+=operator does not fail, even though it seems that it should. It just adds to zero in the first instance.On the other hand, it may happen that not only the number of data but the number of animals is so large that one would not like to store the hash
%a. In this case, one can still calculate totals, provided only that the data are sorted by animal in the input, as they are in your example. In this case, something like the following might suit (though regrettably it is not nearly so neat as the above).