Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6218307
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T07:35:00+00:00 2026-05-24T07:35:00+00:00

I used pdf2text from PDFminer to reduce a PDF to text. Unfortunately it contains

  • 0

I used pdf2text from PDFminer to reduce a PDF to text. Unfortunately it contains special characters. Let me show output from my console

>>>a=pdf_to_text("ap.pdf")

heres a sample of it, a little truncated

>>>a[5000:5500]
'f one architect. Decades ...... but to re\xef\xac\x82ect\none set of design ideas, than to have one that contains many\ngood but independent and uncoordinated ideas.\n1 Joshua Bloch, \xe2\x80\x9cHow to Design a Good API and Why It Matters\xe2\x80\x9d, G......=-3733'

I understood that I must encode it

>>>a[5000:5500].encode('utf-8')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 237: ordinal not in range(128)

I searched around a bit and tried them, notably Replace special characters in python . The input comes from PDFminer, so its tough (AFAIK) to control that. What is the way to make proper plaintext from this output?

What am I doing wrong?

–A quick fix: change PDFminer’s codec to ascii- but it’s not a lasting solution–

–Abandoned the quick fix for the answer- changing the codec removes information —

–A relavent topic as mentioned by Maxim http://en.wikipedia.org/wiki/Windows-1251 —

  • 1 1 Answer
  • 3 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T07:35:04+00:00Added an answer on May 24, 2026 at 7:35 am

    This problem often occurs when non-ASCII text is stored in str objects. What you are trying to do is to encode in utf-8 a string already encoded in some encoding (because it contains characters with codes above 0x7f).

    To encode such a string in utf-8 it has to be first decoded. Assuming that the original text encoding is cp1251 (replace it with your actual encoding), something like the following would do the trick:

    u = s.decode('cp1251')  # decode from cp1251 byte (str) string to unicode string
    s = u.encode('utf-8')   # re-encode unicode string to  utf-8 byte (str) string
    

    Basically, the above snippet does what iconv --from-code=CP1251 --to-code=UTF-8 command does, i.e. it converts the string from one encoding to another.

    Some useful links:

    • Python Unicode HOWTO
    • Developing Unicode-aware Applications in Python
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I used Session[EmpName] = Convert.ToString(Request.QueryString[1]); lblEmployeeName.Text = Session[EmpName].ToString; to show the data in label
I used the following code to display the event details fetched from the database.
I used the follwoing statement to copy over files from one folder to another...
I used this example to hide and show some divs on my site: http://papermashup.com/simple-jquery-showhide-div/
I used below command to convert videos from FLV,M4V to MP4. ffmpeg -y -i
Used http://www.ilbcfreeware.org/software.html - I only get static from the files that ilbc_test.exe creates. Does
I used zWeatherFeed to show weather, but I cant find any example to change
I used to get all public vars from a class inside a class with
Used code first and everything appears to work apart from the below which also
I used inputView to show uipickerview for my textfield , but I use same

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.