Although this is pretty basic, I can’t find a similar question, so please link

Question

0

Asked: May 20, 20262026-05-20T10:40:28+00:00 2026-05-20T10:40:28+00:00

Although this is pretty basic, I can’t find a similar question, so please link

0

Although this is pretty basic, I can’t find a similar question, so please link to one if you know of an existing question/solution on SO.

I have a .txt file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.

First, I glob a directory for .txt files – there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.

my $txt_file = glob "/some/cheese/dir/*.txt";

Then I open the file with this line:

open (F, $txt_file) || die ("Could not open $txt_file");

As per the data dictionary for this file, I’m parsing each “field” out of each line using Perl’s substr() function within a while loop.

while ($line = <F>)
{
$nom_stat   = substr($line,0,1);
$lname      = substr($line,1,15);
$fname      = substr($line,16,15);
$mname      = substr($line,31,1);
$address    = substr($line,32,30);
$city       = substr($line,62,20);
$st         = substr($line,82,2);
$zip        = substr($line,84,5);
$lnum       = substr($line,93,9);
$cl_rank    = substr($line,108,4);
$ceeb       = substr($line,112,6);
$county     = substr($line,118,2);
$sex        = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major  = substr($line,122,3);
$acad_idx   = substr($line,125,3);
$gpa        = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}

This approach takes a lot of time to process each line and I’m wondering if there is a more efficient way of getting each field out of each line of the file.

Can anyone suggest a more efficient/preferred method?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T10:40:28+00:00

A single regular expression, compiled and cached using the /o option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:

         Rate unpack substr regexp
 unpack 2.59/s     --   -59%   -67%
 substr 6.23/s   141%     --   -21%
 regexp 7.90/s   206%    27%     --

Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters 0123456789). So it’s the same input size as the data you’re working with.

The Benchmark::cmpthese() method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.

The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.

#!/usr/bin/env perl
use Benchmark qw(:all);
use strict;

sub use_substr() {
    print "use_substr(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = (substr($line,0,1),
                           substr($line,1,15),
                           substr($line,16,15),
                           substr($line,31,1),
                           substr($line,32,30),
                           substr($line,62,20),
                           substr($line,82,2),
                           substr($line,84,5),
                           substr($line,93,9),
                           substr($line,108,4),
                           substr($line,112,6),
                           substr($line,118,2),
                           substr($line,120,1),
                           substr($line,121,1),
                           substr($line,122,3),
                           substr($line,125,3),
                           substr($line,128,5),
                           substr($line,135,4));
       #print "use_substr(): \$lname = $lname\n";
       #print "use_substr(): \$gpa   = $gpa\n";
    }    
    close(F);
    return 1;
}

sub use_regexp() {
    print "use_regexp(): New itteration\n";
    my $pattern = '^(.{1})(.{15})(.{15})(.{1})(.{30})(.{20})(.{2})(.{5})(.{9})(.{4})(.{6})(.{2})(.{1})(.{1})(.{3})(.{3})(.{5})(.{4})';
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        if ( $line =~ m/$pattern/o ) {
            my($nom_stat, 
               $lname,   
               $fname,      
               $mname,    
               $address,     
               $city,    
               $st,       
               $zip,         
               $lnum,        
               $cl_rank,
               $ceeb,    
               $county,
               $sex,     
               $grant_type,
               $int_major, 
               $acad_idx,  
               $gpa,   
               $hs_cl_size) = ( $1,
                                $2,
                                $3,
                                $4,
                                $5,
                                $6,
                                $7,
                                $8,
                                $9,
                                $10,
                                $11,
                                $12,
                                $13,
                                $14,
                                $15,
                                $16,
                                $17,
                                $18);
            #print "use_regexp(): \$lname = $lname\n";
            #print "use_regexp(): \$gpa   = $gpa\n";
        }
    }    
    close(F);
    return 1;
}

sub use_unpack() {
    print "use_unpack(): New itteration\n";
    open(F, "<data.txt") or die $!;
    while (my $line = <F>) {
        my($nom_stat, 
           $lname,   
           $fname,      
           $mname,    
           $address,     
           $city,    
           $st,       
           $zip,         
           $lnum,        
           $cl_rank,
           $ceeb,    
           $county,
           $sex,     
           $grant_type,
           $int_major, 
           $acad_idx,  
           $gpa,   
           $hs_cl_size) = unpack(
               "(A1)(A15)(A15)(A1)(A30)(A20)(A2)(A5)(A9)(A4)(A6)(A2)(A1)(A1)(A3)(A3)(A5)(A4)(A*)", $line
               );
        #print "use_unpack(): \$lname = $lname\n";
        #print "use_unpack(): \$gpa   = $gpa\n";
    }
    close(F);
    return 1;
}

# Benchmark it
my $itt = 50;
cmpthese($itt, {
        'substr' => sub { use_substr(); },
        'regexp' => sub { use_regexp(); },
        'unpack' => sub { use_unpack(); },
    }
);
exit(0)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Although this is pretty basic, I can’t find a similar question, so please link

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply