I have a text file (basically an error log with date, timestamp and some data) in the following pattern:
mm/dd/yy 12:00:00:0001
This is line 1
This is line 2
mm/dd/yy 12:00:00:0004
This is line 3
This is line 4
This is line 5
mm/dd/yy 12:00:00:0004
This is line 6
This is line 7
I’m new at Perl and need to write a script that searches the file for timestamps and merges the data that have the same timestamp in it.
I’m expecting the following output for the above sample.
mm/dd/yy 12:00:00:0001
This is line 1
This is line 2
mm/dd/yy 12:00:00:0004
This is line 3
This is line 4
This is line 5
This is line 6
This is line 7
What’s the best way to get this done?
I’ve had to do this task before on some very large files and the timestamps did not come in order. I didn’t want to store it all in memory. I accomplished the task by using a three-pass solution:
This was fast enough for my task where I could let it run while I went for a cup of coffee, but you might have to do something more fancy if you need the results really quickly.
use strict; use warnings; use File::Temp qw(tempfile); my( $temp_fh, $temp_filename ) = tempfile( UNLINK => 1 ); # read each line, tag with timestamp, and write to temp file # will sort and undo later. my $current_timestamp = ''; LINE: while( <DATA> ) { chomp; if( m|^\d\d/\d\d/\d\d \d\d:\d\d:\d\d:\d\d\d\d$| ) # timestamp line { $current_timestamp = $_; next LINE; } elsif( m|\S| ) # line with non-whitespace (not a "blank line") { print $temp_fh "[$current_timestamp] $_\n"; } else # blank lines { next LINE; } } close $temp_fh; # sort the file by lines using some very fast sorter system( "sort", qw(-o sorted.txt), $temp_filename ); # read the sorted file and turn back into starting format open my($in), "<", 'sorted.txt' or die "Could not read sorted.txt: $!"; $current_timestamp = ''; while( <$in> ) { my( $timestamp, $line ) = m/\[(.*?)] (.*)/; if( $timestamp ne $current_timestamp ) { $current_timestamp = $timestamp; print $/, $timestamp, $/; } print $line, $/; } unlink $temp_file, 'sorted.txt'; __END__ 01/01/70 12:00:00:0004 This is line 3 This is line 4 This is line 5 01/01/70 12:00:00:0001 This is line 1 This is line 2 01/01/70 12:00:00:0004 This is line 6 This is line 7