Although this is pretty basic, I can’t find a similar question, so please link to one if you know of an existing question/solution on SO.
I have a .txt file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.
First, I glob a directory for .txt files – there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.
my $txt_file = glob "/some/cheese/dir/*.txt";
Then I open the file with this line:
open (F, $txt_file) || die ("Could not open $txt_file");
As per the data dictionary for this file, I’m parsing each “field” out of each line using Perl’s substr() function within a while loop.
while ($line = <F>)
{
$nom_stat = substr($line,0,1);
$lname = substr($line,1,15);
$fname = substr($line,16,15);
$mname = substr($line,31,1);
$address = substr($line,32,30);
$city = substr($line,62,20);
$st = substr($line,82,2);
$zip = substr($line,84,5);
$lnum = substr($line,93,9);
$cl_rank = substr($line,108,4);
$ceeb = substr($line,112,6);
$county = substr($line,118,2);
$sex = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major = substr($line,122,3);
$acad_idx = substr($line,125,3);
$gpa = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}
This approach takes a lot of time to process each line and I’m wondering if there is a more efficient way of getting each field out of each line of the file.
Can anyone suggest a more efficient/preferred method?
A single regular expression, compiled and cached using the
/ooption, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters
0123456789). So it’s the same input size as the data you’re working with.The
Benchmark::cmpthese()method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.