I received a PDF file of tabular data that I’ve converted to plaintext for

Question

0

Asked: June 13, 20262026-06-13T17:33:42+00:00 2026-06-13T17:33:42+00:00

I received a PDF file of tabular data that I’ve converted to plaintext for

0

I received a PDF file of tabular data that I’ve converted to plaintext for processing.

pdftotext -nopgbrk -layout file.pdf

This does a pretty decent job but uses spaces to separate/delimit the fields in the columns and seems primarily interested in preserving the visual layout rather than ‘structural’ layout Ie., there is no consistent or reliable delimiter. So now I convert 2 or more spaces to tabs:

sed -i 's/[[:space:]]\{2,\}/\t/g' file.txt

Using cat -vte I see that this does a pretty nice job placing tabs in the file ….however, there are a few inconsistencies with the second field that I’d like to ask your help with.

Please see the following comparison for clarification:

Normal/Expected results:

79879   5.6     0.5     MG      EN      SQ      TFK World Report 09-24-2004     Time for Kids Editors,  ORD1915643
79880   5.5     0.5     MG      EN      SQ      TFK World Report 10-01-2004     Time for Kids Editors,  ORD1915643
79881   6.0     0.5     MG      EN      SQ      TFK World Report 10-08-2004     Time for Kids Editors,  ORD1915643
79882   5.5     0.5     MG      EN      SQ      TFK World Report 10-22-2004     Time for Kids Editors,  ORD1915643
79883   5.9     0.5     MG      EN      SQ      TFK World Report 10-29-2004     Time for Kids Editors,  ORD1915643

Some oddities and inconsistencies:

72      5.2 3.0 MG      EN      LS      Ramona and Her Father   Cleary, Beverly ORD2111460
491     4.8 4.0 MG      EN      LS      Ramona and Her Mother   Cleary, Beverly ORD1748201
134     5.6 3.0 MG      EN      LS      Ramona Quimby, Age 8    Cleary, Beverly ORD1748201
29      4.7     5.0 MG  EN      LS      From the Mixed-Up Files of Mrs. Basil E.        Konigsburg, E.L.        ORD1525579

Note that the ‘smushing’ effect may occur in either field 2 or field 3 …AND, that the number of fields differs with the ‘normal’ results by either 1 or 2.

…So, to solve this I’ve tried stuff like the following:

awk -F'\t' 'OFS="\t";$1 ~ /^[[:digit:]]/{print $1,gensub(/[[:space:]]/,"\t","g",$2),$3,$4,$5,$6,$7}' file.txt

This seems to double each, or at least most, line(s) and cuts off fields.

EDIT
This seems to be working …so far, still testing.

awk -F'\t' '{$2 = gensub( /[[:space:]]/, "\t", "g", $2 );
             $3 = gensub( /[[:space:]]/, "\t", "g", $3 )}
             {OFS="\t";print}' file.txt

Is there a simple way to solve this issue using awk?

UPDATE

Some have requested a sample representing the state just prior to my space tab conversion. The following represents a sample near where the previous sample is in the document. Looks about the same …except one [below] is spaced, the other [above] tabbed. Note the way pdftotext deals with column 2 in the different samples below …sometimes splitting, sometimes making a single column.

Sample 1:

    72   5.2 3.0 MG       EN   RP     Ramona and Her Father                     Cleary, Beverly              ORD0630871
are orphans
   491   4.8 4.0 MG       EN   RP     Ramona and Her Mother                     Cleary, Beverly              ORD0785414
are also orphans
   186   4.8 4.0 MG       EN   RP     Ramona Forever                            Cleary, Beverly              ORD0630871
forever the orphan

Sample 2:

  79871    5.7   0.5   MG   EN    SQ        TFK World Report 03-18-2005         Time for Kids Editors,       ORD1915643
  79872    5.8   0.5   MG   EN    SQ        TFK World Report 04-01-2005         Time for Kids Editors,       ORD1915643
  79873    6.0   0.5   MG   EN    SQ        TFK World Report 04-08-2005         Time for Kids Editors,       ORD1915643

UPDATE 2

Made the following changes to Ed’s submission. Thinking it could be simplified, but it works. It has to allow for the orphaned lines.

$1 ~ /^[[:digit:]]+/{
   for (i=1;i<=6;i++)
      printf "%s\t", $i

   n = split($0,tmp,/  +/)

   for (i=2;i>=0;i--)
      printf "%s\t", tmp[n-i]

   print ""
}
$1 ~ /^[^[:digit:]]+/ {print $0}

Maybe this is prettier:

{
        if ($1 ~ /^[[:digit:]]+/) {
                for (i=1;i<=6;i++)
                printf "%s\t", $i

                n = split($0,tmp,/  +/)

                for (i=2;i>=0;i--)
                printf "%s\t", tmp[n-i]

                print ""
        }
        else print $0;
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T17:33:43+00:00

Rather than us starting with the output of a sed command that may be what is corrupting your data, post your data BEFORE you run that sed command on it and let us go from there. I suspect that since you say the PDF conversion tool preserves the “visual layout” that the right solution is probably to simply use gawk’s FIELDWIDTHS capability on that so you parse the PDF converters output based on the width of the fields rather than trying to figure out how many spaces it takes to represent a field separator.

EDIT: here’s a match()-based solution for comparison, but I actually now think @ghoti is right and the solutions is simpler than this:

$ cat file
    72   5.2 3.0 MG       EN   RP     Ramona and Her Father     Cleary, Beverly    ORD0630871
   491   4.8 4.0 MG       EN   RP     Ramona and Her Mother     Cleary, Beverly    ORD0785414
  79872  5.8  0.5  MG  EN   SQ    TFK World Report 04-01-2005  Time for Kids Editors,  ORD1915643
  79873  6.0  0.5  MG  EN   SQ    TFK World Report 04-08-2005  Time for Kids Editors,  ORD1915643
$
$ cat tst.awk
BEGIN {
   whl = "([[:digit:]]+)"
   dec = "([[:digit:]]+[.][[:digit:]]+)"
   wrd = "([^ ]+)"
   rst = "(.*)"
   s   = "[ ]+"
   fmt = whl s dec s dec s wrd s wrd s wrd s rst
}
{
   match($0,fmt,arr)
   split(arr[7],tmp,/  +/)
   arr[7] = tmp[1]
   arr[8] = tmp[2]
   arr[9] = tmp[3]

   for (i=1;i<=9;i++)
      printf "<%s>", arr[i]
   print ""
}
$
$ awk -f tst.awk file
<72><5.2><3.0><MG><EN><RP><Ramona and Her Father><Cleary, Beverly><ORD0630871>
<491><4.8><4.0><MG><EN><RP><Ramona and Her Mother><Cleary, Beverly><ORD0785414>
<79872><5.8><0.5><MG><EN><SQ><TFK World Report 04-01-2005><Time for Kids Editors,><ORD1915643>
<79873><6.0><0.5><MG><EN><SQ><TFK World Report 04-08-2005><Time for Kids Editors,><ORD1915643>

EDIT: yeah, here’s a simpler solution, just print the first 6 fields and then split the rest on a multi-space separator:

$ cat tst2.awk
{
   for (i=1;i<=6;i++)
      printf "<%s>", $i

   n = split($0,tmp,/  +/)

   for (i=2;i>=0;i--)
      printf "<%s>", tmp[n-i]

   print ""
}
$
$ awk -f tst2.awk file
<72><5.2><3.0><MG><EN><RP><Ramona and Her Father><Cleary, Beverly><ORD0630871>
<491><4.8><4.0><MG><EN><RP><Ramona and Her Mother><Cleary, Beverly><ORD0785414>
<79872><5.8><0.5><MG><EN><SQ><TFK World Report 04-01-2005><Time for Kids Editors,><ORD1915643>
<79873><6.0><0.5><MG><EN><SQ><TFK World Report 04-08-2005><Time for Kids Editors,><ORD1915643>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I received a PDF file of tabular data that I’ve converted to plaintext for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply