Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8770453
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T17:33:42+00:00 2026-06-13T17:33:42+00:00

I received a PDF file of tabular data that I’ve converted to plaintext for

  • 0

I received a PDF file of tabular data that I’ve converted to plaintext for processing.

pdftotext -nopgbrk -layout file.pdf

This does a pretty decent job but uses spaces to separate/delimit the fields in the columns and seems primarily interested in preserving the visual layout rather than ‘structural’ layout Ie., there is no consistent or reliable delimiter. So now I convert 2 or more spaces to tabs:

sed -i 's/[[:space:]]\{2,\}/\t/g' file.txt

Using cat -vte I see that this does a pretty nice job placing tabs in the file ….however, there are a few inconsistencies with the second field that I’d like to ask your help with.

Please see the following comparison for clarification:

Normal/Expected results:

79879   5.6     0.5     MG      EN      SQ      TFK World Report 09-24-2004     Time for Kids Editors,  ORD1915643
79880   5.5     0.5     MG      EN      SQ      TFK World Report 10-01-2004     Time for Kids Editors,  ORD1915643
79881   6.0     0.5     MG      EN      SQ      TFK World Report 10-08-2004     Time for Kids Editors,  ORD1915643
79882   5.5     0.5     MG      EN      SQ      TFK World Report 10-22-2004     Time for Kids Editors,  ORD1915643
79883   5.9     0.5     MG      EN      SQ      TFK World Report 10-29-2004     Time for Kids Editors,  ORD1915643

Some oddities and inconsistencies:

72      5.2 3.0 MG      EN      LS      Ramona and Her Father   Cleary, Beverly ORD2111460
491     4.8 4.0 MG      EN      LS      Ramona and Her Mother   Cleary, Beverly ORD1748201
134     5.6 3.0 MG      EN      LS      Ramona Quimby, Age 8    Cleary, Beverly ORD1748201
29      4.7     5.0 MG  EN      LS      From the Mixed-Up Files of Mrs. Basil E.        Konigsburg, E.L.        ORD1525579

Note that the ‘smushing’ effect may occur in either field 2 or field 3 …AND, that the number of fields differs with the ‘normal’ results by either 1 or 2.

…So, to solve this I’ve tried stuff like the following:

awk -F'\t' 'OFS="\t";$1 ~ /^[[:digit:]]/{print $1,gensub(/[[:space:]]/,"\t","g",$2),$3,$4,$5,$6,$7}' file.txt

This seems to double each, or at least most, line(s) and cuts off fields.

EDIT
This seems to be working …so far, still testing.

awk -F'\t' '{$2 = gensub( /[[:space:]]/, "\t", "g", $2 );
             $3 = gensub( /[[:space:]]/, "\t", "g", $3 )}
             {OFS="\t";print}' file.txt

Is there a simple way to solve this issue using awk?

UPDATE

Some have requested a sample representing the state just prior to my space tab conversion. The following represents a sample near where the previous sample is in the document. Looks about the same …except one [below] is spaced, the other [above] tabbed. Note the way pdftotext deals with column 2 in the different samples below …sometimes splitting, sometimes making a single column.

Sample 1:

    72   5.2 3.0 MG       EN   RP     Ramona and Her Father                     Cleary, Beverly              ORD0630871
are orphans
   491   4.8 4.0 MG       EN   RP     Ramona and Her Mother                     Cleary, Beverly              ORD0785414
are also orphans
   186   4.8 4.0 MG       EN   RP     Ramona Forever                            Cleary, Beverly              ORD0630871
forever the orphan

Sample 2:

  79871    5.7   0.5   MG   EN    SQ        TFK World Report 03-18-2005         Time for Kids Editors,       ORD1915643
  79872    5.8   0.5   MG   EN    SQ        TFK World Report 04-01-2005         Time for Kids Editors,       ORD1915643
  79873    6.0   0.5   MG   EN    SQ        TFK World Report 04-08-2005         Time for Kids Editors,       ORD1915643

UPDATE 2

Made the following changes to Ed’s submission. Thinking it could be simplified, but it works. It has to allow for the orphaned lines.

$1 ~ /^[[:digit:]]+/{
   for (i=1;i<=6;i++)
      printf "%s\t", $i

   n = split($0,tmp,/  +/)

   for (i=2;i>=0;i--)
      printf "%s\t", tmp[n-i]

   print ""
}
$1 ~ /^[^[:digit:]]+/ {print $0}

Maybe this is prettier:

{
        if ($1 ~ /^[[:digit:]]+/) {
                for (i=1;i<=6;i++)
                printf "%s\t", $i

                n = split($0,tmp,/  +/)

                for (i=2;i>=0;i--)
                printf "%s\t", tmp[n-i]

                print ""
        }
        else print $0;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T17:33:43+00:00Added an answer on June 13, 2026 at 5:33 pm

    Rather than us starting with the output of a sed command that may be what is corrupting your data, post your data BEFORE you run that sed command on it and let us go from there. I suspect that since you say the PDF conversion tool preserves the “visual layout” that the right solution is probably to simply use gawk’s FIELDWIDTHS capability on that so you parse the PDF converters output based on the width of the fields rather than trying to figure out how many spaces it takes to represent a field separator.

    EDIT: here’s a match()-based solution for comparison, but I actually now think @ghoti is right and the solutions is simpler than this:

    $ cat file
        72   5.2 3.0 MG       EN   RP     Ramona and Her Father     Cleary, Beverly    ORD0630871
       491   4.8 4.0 MG       EN   RP     Ramona and Her Mother     Cleary, Beverly    ORD0785414
      79872  5.8  0.5  MG  EN   SQ    TFK World Report 04-01-2005  Time for Kids Editors,  ORD1915643
      79873  6.0  0.5  MG  EN   SQ    TFK World Report 04-08-2005  Time for Kids Editors,  ORD1915643
    $
    $ cat tst.awk
    BEGIN {
       whl = "([[:digit:]]+)"
       dec = "([[:digit:]]+[.][[:digit:]]+)"
       wrd = "([^ ]+)"
       rst = "(.*)"
       s   = "[ ]+"
       fmt = whl s dec s dec s wrd s wrd s wrd s rst
    }
    {
       match($0,fmt,arr)
       split(arr[7],tmp,/  +/)
       arr[7] = tmp[1]
       arr[8] = tmp[2]
       arr[9] = tmp[3]
    
       for (i=1;i<=9;i++)
          printf "<%s>", arr[i]
       print ""
    }
    $
    $ awk -f tst.awk file
    <72><5.2><3.0><MG><EN><RP><Ramona and Her Father><Cleary, Beverly><ORD0630871>
    <491><4.8><4.0><MG><EN><RP><Ramona and Her Mother><Cleary, Beverly><ORD0785414>
    <79872><5.8><0.5><MG><EN><SQ><TFK World Report 04-01-2005><Time for Kids Editors,><ORD1915643>
    <79873><6.0><0.5><MG><EN><SQ><TFK World Report 04-08-2005><Time for Kids Editors,><ORD1915643>
    

    EDIT: yeah, here’s a simpler solution, just print the first 6 fields and then split the rest on a multi-space separator:

    $ cat tst2.awk
    {
       for (i=1;i<=6;i++)
          printf "<%s>", $i
    
       n = split($0,tmp,/  +/)
    
       for (i=2;i>=0;i--)
          printf "<%s>", tmp[n-i]
    
       print ""
    }
    $
    $ awk -f tst2.awk file
    <72><5.2><3.0><MG><EN><RP><Ramona and Her Father><Cleary, Beverly><ORD0630871>
    <491><4.8><4.0><MG><EN><RP><Ramona and Her Mother><Cleary, Beverly><ORD0785414>
    <79872><5.8><0.5><MG><EN><SQ><TFK World Report 04-01-2005><Time for Kids Editors,><ORD1915643>
    <79873><6.0><0.5><MG><EN><SQ><TFK World Report 04-08-2005><Time for Kids Editors,><ORD1915643>
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a PDF document that I just received via file upload (InputFile). I'd
I have a method that returns a PDF file using DOMPDF . It sends
I receive XML file that includes PDF content: <pdf> <pdfContent>JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PCAvV....... How can I save
I'm generating pdf file with IoTcpdfBundle using Symfony2, but there's a strange behaviour that
I have a method that generates a PDF file using Reportlab library: def obtenerPDFNuevoPedido(self,
I can send data via a button in a pdf file just fine and
I received an e-mail which contained a link that looked like it was to
I received a list of numbers in Custom format (Type: 000000) that represent military
I received a demand to correct a ASP website that have lots of functions
I received this error on following code.Is this + sign is not available in

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.