I received a PDF file of tabular data that I’ve converted to plaintext for processing.
pdftotext -nopgbrk -layout file.pdf
This does a pretty decent job but uses spaces to separate/delimit the fields in the columns and seems primarily interested in preserving the visual layout rather than ‘structural’ layout Ie., there is no consistent or reliable delimiter. So now I convert 2 or more spaces to tabs:
sed -i 's/[[:space:]]\{2,\}/\t/g' file.txt
Using cat -vte I see that this does a pretty nice job placing tabs in the file ….however, there are a few inconsistencies with the second field that I’d like to ask your help with.
Please see the following comparison for clarification:
Normal/Expected results:
79879 5.6 0.5 MG EN SQ TFK World Report 09-24-2004 Time for Kids Editors, ORD1915643 79880 5.5 0.5 MG EN SQ TFK World Report 10-01-2004 Time for Kids Editors, ORD1915643 79881 6.0 0.5 MG EN SQ TFK World Report 10-08-2004 Time for Kids Editors, ORD1915643 79882 5.5 0.5 MG EN SQ TFK World Report 10-22-2004 Time for Kids Editors, ORD1915643 79883 5.9 0.5 MG EN SQ TFK World Report 10-29-2004 Time for Kids Editors, ORD1915643
Some oddities and inconsistencies:
72 5.2 3.0 MG EN LS Ramona and Her Father Cleary, Beverly ORD2111460 491 4.8 4.0 MG EN LS Ramona and Her Mother Cleary, Beverly ORD1748201 134 5.6 3.0 MG EN LS Ramona Quimby, Age 8 Cleary, Beverly ORD1748201 29 4.7 5.0 MG EN LS From the Mixed-Up Files of Mrs. Basil E. Konigsburg, E.L. ORD1525579
Note that the ‘smushing’ effect may occur in either field 2 or field 3 …AND, that the number of fields differs with the ‘normal’ results by either 1 or 2.
…So, to solve this I’ve tried stuff like the following:
awk -F'\t' 'OFS="\t";$1 ~ /^[[:digit:]]/{print $1,gensub(/[[:space:]]/,"\t","g",$2),$3,$4,$5,$6,$7}' file.txt
This seems to double each, or at least most, line(s) and cuts off fields.
EDIT
This seems to be working …so far, still testing.
awk -F'\t' '{$2 = gensub( /[[:space:]]/, "\t", "g", $2 );
$3 = gensub( /[[:space:]]/, "\t", "g", $3 )}
{OFS="\t";print}' file.txt
Is there a simple way to solve this issue using awk?
UPDATE
Some have requested a sample representing the state just prior to my space tab conversion. The following represents a sample near where the previous sample is in the document. Looks about the same …except one [below] is spaced, the other [above] tabbed. Note the way pdftotext deals with column 2 in the different samples below …sometimes splitting, sometimes making a single column.
Sample 1:
72 5.2 3.0 MG EN RP Ramona and Her Father Cleary, Beverly ORD0630871
are orphans
491 4.8 4.0 MG EN RP Ramona and Her Mother Cleary, Beverly ORD0785414
are also orphans
186 4.8 4.0 MG EN RP Ramona Forever Cleary, Beverly ORD0630871
forever the orphan
Sample 2:
79871 5.7 0.5 MG EN SQ TFK World Report 03-18-2005 Time for Kids Editors, ORD1915643 79872 5.8 0.5 MG EN SQ TFK World Report 04-01-2005 Time for Kids Editors, ORD1915643 79873 6.0 0.5 MG EN SQ TFK World Report 04-08-2005 Time for Kids Editors, ORD1915643
UPDATE 2
Made the following changes to Ed’s submission. Thinking it could be simplified, but it works. It has to allow for the orphaned lines.
$1 ~ /^[[:digit:]]+/{
for (i=1;i<=6;i++)
printf "%s\t", $i
n = split($0,tmp,/ +/)
for (i=2;i>=0;i--)
printf "%s\t", tmp[n-i]
print ""
}
$1 ~ /^[^[:digit:]]+/ {print $0}
Maybe this is prettier:
{
if ($1 ~ /^[[:digit:]]+/) {
for (i=1;i<=6;i++)
printf "%s\t", $i
n = split($0,tmp,/ +/)
for (i=2;i>=0;i--)
printf "%s\t", tmp[n-i]
print ""
}
else print $0;
}
Rather than us starting with the output of a sed command that may be what is corrupting your data, post your data BEFORE you run that sed command on it and let us go from there. I suspect that since you say the PDF conversion tool preserves the “visual layout” that the right solution is probably to simply use gawk’s FIELDWIDTHS capability on that so you parse the PDF converters output based on the width of the fields rather than trying to figure out how many spaces it takes to represent a field separator.
EDIT: here’s a match()-based solution for comparison, but I actually now think @ghoti is right and the solutions is simpler than this:
EDIT: yeah, here’s a simpler solution, just print the first 6 fields and then split the rest on a multi-space separator: