I’m using pdftotext to convert Spanish language text. Characters with accents or tildes are

Question

0

Asked: May 17, 20262026-05-17T00:48:39+00:00 2026-05-17T00:48:39+00:00

I’m using pdftotext to convert Spanish language text. Characters with accents or tildes are

0

I’m using pdftotext to convert Spanish language text. Characters with accents or tildes are output in a systematic way that requires further conversion. Accents and tildes appear in the converted text in the correct position but without the letter. The letter almost always appears at the end of the output line. When it doesn’t, I can fix those by hand.

For example, the pdf sentence

¿Por qué?

becomes

¿Por qu´? e

I know enough about sed, awk and grep to think it can be done with some combination of those – and that it would take me a long time. I intend to use this to process all the pdf files in a folder.

The sentences appear in Spanish-English pairs on separate lines. I’d like to concatenate the two with a semicolon delimiter, the import format of my flash card app (Anki). Delete all the content that are not Spanish-English sentence pairs.

For example, convert this output

B:

¿Por qu´? e
Why?

into

¿Por qué?;Why?

Where there are multiple accents, tildes or a mix of both, the letters trailing the line are in the correct order and may be comma separated by commas. For example, the pdf sentence

Sí pero vi en la televisión que iba a llover.

becomes

S´ pero vi en la televisi´n que iba a llover. ı, o

or
S´ pero vi en la televisi´n que iba a llover. ı o

Output File Format

The sentences always have an end punctuation, either “!”, “?” or “.”. For those unfamiliar with Spanish, vowels (aeiou) are the only letters which may have an accent, the letter “n” is the only one that may have a tilde, and the 2 special characters may be found on both upper and lower case letters.

The first output line may contain the level and title of the pdf. The level and title always precede the first occurrence of “A:”

I’m not interested in the line “Key Vocabulary” or anything that appears on any subsequent lines.

pdftotext run with UTF8 encoding. My OS is Linux Mint 9, which is based on Ubuntu 10.04

Below are two sample output files.

Output 1

Elementary - Credit Card A:

(B0089)

Me da la cuenta, por favor.
Bring me the check, please.

B:

Se la doy enseguida.
I’ll bring it to you right away.

B:

Perd´n se˜or, pero no aceptamos tarjeta. o n
Sorry sir, but we don’t take cards.

A:

¿No aceptan ninguna tarjeta de cr´dito? e
You don’t take any credit cards?


Key Vocabulary

tarjeta cr´dito e cuenta

Noun Noun Noun

card credit bill

Output 2

Elementary - My computer is not working A: ¡No puede ser!
It can’t be!

(B0079)

B:

¿Qu´ pasa? e
What happened?

A:

Mi computadora no est´ funcionando. a
My computer is not working.

B:

Rein´ ıciala.
Restart it.


Key Vocabulary

funcionar

Verb

to work

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T00:48:40+00:00

Edit: Minor change to the NR == 1 line to accomodate variations in the first line of the input file. For this to work, it depends on “A:” only appearing once in the first line.

I also should add that this program depends on features of GNU AWK (gawk).

~~There seem to be some inconsistencies between your two output examples. The program below works with the first one. In the second example, this line contains both header and a data line:~~

Elementary – My computer is not working A: ¡No puede ser!

and this line contains the character to be substituted within the line rather than after the final punctuation.

Rein´ ıciala.

These issues could be accommodated by modifying the program if needed.

Also, you mention that these characters will be separated by commas, but the examples don’t have them (in the one place where it might have appeared). It doesn’t matter because my program ignores commas.

You can run the following program like this:

$ ./scriptname inputfile

Here it is in all its kludginess:

#!/usr/bin/awk -f
BEGIN {
    FS = "[.?!]"
    chars["n"] = "˜ñ"
    chars["N"] = "˜Ñ"
    chars["a"] = "´á"
    chars["A"] = "´Á"
    chars["e"] = "´é"
    chars["E"] = "´É"
    chars["ı"] = "´í"
    chars["I"] = "´Í"
    chars["o"] = "´ó"
    chars["O"] = "´Ó"
    chars["u"] = "´ú"
    chars["U"] = "´Ú"
}

/Key Vocabulary/ {exit}

    NR == 1 { sub(".*A: *","",$1) }

    /^\(.*\) *$/ || \
    /^(A|B): *$/ || \
    /^ *$/ \
        {next}

{
    punct = gensub($1"(.)"$2,"\\1","",$0)

    for (i=0; i<=length($2); i++) {
        char = substr($2,i,1);
        if (char != " ") {
            sub(substr(chars[char],1,1),substr(chars[char],2,1),$1)
        }
    }

    printf "%s%s;", $1, punct
    getline
    print
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using pdftotext to convert Spanish language text. Characters with accents or tildes are

Output File Format

Output 1

Output 2

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply