Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4024056
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 20, 20262026-05-20T10:40:28+00:00 2026-05-20T10:40:28+00:00

Although this is pretty basic, I can’t find a similar question, so please link

  • 0

Although this is pretty basic, I can’t find a similar question, so please link to one if you know of an existing question/solution on SO.


I have a .txt file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.

First, I glob a directory for .txt files – there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.

my $txt_file = glob "/some/cheese/dir/*.txt";

Then I open the file with this line:

open (F, $txt_file) || die ("Could not open $txt_file");

As per the data dictionary for this file, I’m parsing each “field” out of each line using Perl’s substr() function within a while loop.

while ($line = <F>)
{
$nom_stat   = substr($line,0,1);
$lname      = substr($line,1,15);
$fname      = substr($line,16,15);
$mname      = substr($line,31,1);
$address    = substr($line,32,30);
$city       = substr($line,62,20);
$st         = substr($line,82,2);
$zip        = substr($line,84,5);
$lnum       = substr($line,93,9);
$cl_rank    = substr($line,108,4);
$ceeb       = substr($line,112,6);
$county     = substr($line,118,2);
$sex        = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major  = substr($line,122,3);
$acad_idx   = substr($line,125,3);
$gpa        = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}

This approach takes a lot of time to process each line and I’m wondering if there is a more efficient way of getting each field out of each line of the file.

Can anyone suggest a more efficient/preferred method?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-20T10:40:28+00:00Added an answer on May 20, 2026 at 10:40 am

    A single regular expression, compiled and cached using the /o option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:

             Rate unpack substr regexp
     unpack 2.59/s     --   -59%   -67%
     substr 6.23/s   141%     --   -21%
     regexp 7.90/s   206%    27%     --
    

    Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters 0123456789). So it’s the same input size as the data you’re working with.

    The Benchmark::cmpthese() method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.

    The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.

    #!/usr/bin/env perl
    use Benchmark qw(:all);
    use strict;
    
    sub use_substr() {
        print "use_substr(): New itteration\n";
        open(F, "<data.txt") or die $!;
        while (my $line = <F>) {
            my($nom_stat, 
               $lname,   
               $fname,      
               $mname,    
               $address,     
               $city,    
               $st,       
               $zip,         
               $lnum,        
               $cl_rank,
               $ceeb,    
               $county,
               $sex,     
               $grant_type,
               $int_major, 
               $acad_idx,  
               $gpa,   
               $hs_cl_size) = (substr($line,0,1),
                               substr($line,1,15),
                               substr($line,16,15),
                               substr($line,31,1),
                               substr($line,32,30),
                               substr($line,62,20),
                               substr($line,82,2),
                               substr($line,84,5),
                               substr($line,93,9),
                               substr($line,108,4),
                               substr($line,112,6),
                               substr($line,118,2),
                               substr($line,120,1),
                               substr($line,121,1),
                               substr($line,122,3),
                               substr($line,125,3),
                               substr($line,128,5),
                               substr($line,135,4));
           #print "use_substr(): \$lname = $lname\n";
           #print "use_substr(): \$gpa   = $gpa\n";
        }    
        close(F);
        return 1;
    }
    
    sub use_regexp() {
        print "use_regexp(): New itteration\n";
        my $pattern = '^(.{1})(.{15})(.{15})(.{1})(.{30})(.{20})(.{2})(.{5})(.{9})(.{4})(.{6})(.{2})(.{1})(.{1})(.{3})(.{3})(.{5})(.{4})';
        open(F, "<data.txt") or die $!;
        while (my $line = <F>) {
            if ( $line =~ m/$pattern/o ) {
                my($nom_stat, 
                   $lname,   
                   $fname,      
                   $mname,    
                   $address,     
                   $city,    
                   $st,       
                   $zip,         
                   $lnum,        
                   $cl_rank,
                   $ceeb,    
                   $county,
                   $sex,     
                   $grant_type,
                   $int_major, 
                   $acad_idx,  
                   $gpa,   
                   $hs_cl_size) = ( $1,
                                    $2,
                                    $3,
                                    $4,
                                    $5,
                                    $6,
                                    $7,
                                    $8,
                                    $9,
                                    $10,
                                    $11,
                                    $12,
                                    $13,
                                    $14,
                                    $15,
                                    $16,
                                    $17,
                                    $18);
                #print "use_regexp(): \$lname = $lname\n";
                #print "use_regexp(): \$gpa   = $gpa\n";
            }
        }    
        close(F);
        return 1;
    }
    
    sub use_unpack() {
        print "use_unpack(): New itteration\n";
        open(F, "<data.txt") or die $!;
        while (my $line = <F>) {
            my($nom_stat, 
               $lname,   
               $fname,      
               $mname,    
               $address,     
               $city,    
               $st,       
               $zip,         
               $lnum,        
               $cl_rank,
               $ceeb,    
               $county,
               $sex,     
               $grant_type,
               $int_major, 
               $acad_idx,  
               $gpa,   
               $hs_cl_size) = unpack(
                   "(A1)(A15)(A15)(A1)(A30)(A20)(A2)(A5)(A9)(A4)(A6)(A2)(A1)(A1)(A3)(A3)(A5)(A4)(A*)", $line
                   );
            #print "use_unpack(): \$lname = $lname\n";
            #print "use_unpack(): \$gpa   = $gpa\n";
        }
        close(F);
        return 1;
    }
    
    # Benchmark it
    my $itt = 50;
    cmpthese($itt, {
            'substr' => sub { use_substr(); },
            'regexp' => sub { use_regexp(); },
            'unpack' => sub { use_unpack(); },
        }
    );
    exit(0)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Can't seem to find the following information although I'm pretty sure this should be
Although this question and this question are close to what I'm asking, I believe
I have a HashMap (although I guess this question applies to other collections) of
Although somewhat related to this question , I have what I think is a
I admit this is not strictly a programming question, although I do use my
I've seen this format used for comma-delimited lists in some C++ code (although this
I've actually asked about this already in this post although we've gone back to
Although I don't have an iPhone to test this out, my colleague told me
Although I do understand the serious implications of playing with this function (or at
[This is for PC/Visual C++ specifically (although any other answers would be quite illuminating

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.