Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1099959
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T00:48:39+00:00 2026-05-17T00:48:39+00:00

I’m using pdftotext to convert Spanish language text. Characters with accents or tildes are

  • 0

I’m using pdftotext to convert Spanish language text. Characters with accents or tildes are output in a systematic way that requires further conversion. Accents and tildes appear in the converted text in the correct position but without the letter. The letter almost always appears at the end of the output line. When it doesn’t, I can fix those by hand.

For example, the pdf sentence

¿Por qué?

becomes

¿Por qu´? e

I know enough about sed, awk and grep to think it can be done with some combination of those – and that it would take me a long time. I intend to use this to process all the pdf files in a folder.

The sentences appear in Spanish-English pairs on separate lines. I’d like to concatenate the two with a semicolon delimiter, the import format of my flash card app (Anki). Delete all the content that are not Spanish-English sentence pairs.

For example, convert this output

B:

¿Por qu´? e
Why?

into

¿Por qué?;Why?

Where there are multiple accents, tildes or a mix of both, the letters trailing the line are in the correct order and may be comma separated by commas. For example, the pdf sentence

Sí pero vi en la televisión que iba a llover.

becomes

S´ pero vi en la televisi´n que iba a llover. ı, o

or
S´ pero vi en la televisi´n que iba a llover. ı o

Output File Format

The sentences always have an end punctuation, either “!”, “?” or “.”. For those unfamiliar with Spanish, vowels (aeiou) are the only letters which may have an accent, the letter “n” is the only one that may have a tilde, and the 2 special characters may be found on both upper and lower case letters.

The first output line may contain the level and title of the pdf. The level and title always precede the first occurrence of “A:”

I’m not interested in the line “Key Vocabulary” or anything that appears on any subsequent lines.

pdftotext run with UTF8 encoding. My OS is Linux Mint 9, which is based on Ubuntu 10.04

Below are two sample output files.

Output 1

Elementary - Credit Card A:

(B0089)

Me da la cuenta, por favor.
Bring me the check, please.

B:

Se la doy enseguida.
I’ll bring it to you right away.

B:

Perd´n se˜or, pero no aceptamos tarjeta. o n
Sorry sir, but we don’t take cards.

A:

¿No aceptan ninguna tarjeta de cr´dito? e
You don’t take any credit cards?


Key Vocabulary

tarjeta cr´dito e cuenta

Noun Noun Noun

card credit bill

Output 2

Elementary - My computer is not working A: ¡No puede ser!
It can’t be!

(B0079)

B:

¿Qu´ pasa? e
What happened?

A:

Mi computadora no est´ funcionando. a
My computer is not working.

B:

Rein´ ıciala.
Restart it.


Key Vocabulary

funcionar

Verb

to work
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T00:48:40+00:00Added an answer on May 17, 2026 at 12:48 am

    Edit: Minor change to the NR == 1 line to accomodate variations in the first line of the input file. For this to work, it depends on “A:” only appearing once in the first line.

    I also should add that this program depends on features of GNU AWK (gawk).

    There seem to be some inconsistencies between your two output examples. The program below works with the first one. In the second example, this line contains both header and a data line:

    Elementary – My computer is not working A: ¡No puede ser!

    and this line contains the character to be substituted within the line rather than after the final punctuation.

    Rein´ ıciala.

    These issues could be accommodated by modifying the program if needed.

    Also, you mention that these characters will be separated by commas, but the examples don’t have them (in the one place where it might have appeared). It doesn’t matter because my program ignores commas.

    You can run the following program like this:

    $ ./scriptname inputfile
    

    Here it is in all its kludginess:

    #!/usr/bin/awk -f
    BEGIN {
        FS = "[.?!]"
        chars["n"] = "˜ñ"
        chars["N"] = "˜Ñ"
        chars["a"] = "´á"
        chars["A"] = "´Á"
        chars["e"] = "´é"
        chars["E"] = "´É"
        chars["ı"] = "´í"
        chars["I"] = "´Í"
        chars["o"] = "´ó"
        chars["O"] = "´Ó"
        chars["u"] = "´ú"
        chars["U"] = "´Ú"
    }
    
    /Key Vocabulary/ {exit}
    
        NR == 1 { sub(".*A: *","",$1) }
    
        /^\(.*\) *$/ || \
        /^(A|B): *$/ || \
        /^ *$/ \
            {next}
    
    {
        punct = gensub($1"(.)"$2,"\\1","",$0)
    
        for (i=0; i<=length($2); i++) {
            char = substr($2,i,1);
            if (char != " ") {
                sub(substr(chars[char],1,1),substr(chars[char],2,1),$1)
            }
        }
    
        printf "%s%s;", $1, punct
        getline
        print
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

That's pretty much it. I'm using Nokogiri to scrape a web page what has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have a text area in my form which accepts all possible characters from
I need a function that will clean a strings' special characters. I do NOT
I'm working with an upstream system that sometimes sends me text destined for HTML/XML
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
I want to count how many characters a certain string has in PHP, but
For some reason, after submitting a string like this Jack’s Spindle from a text
I am reading a book about Javascript and jQuery and using one of the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.