Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8236091
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T19:02:25+00:00 2026-06-07T19:02:25+00:00

I’m trying to extract text from a large number of PDFs using PDFMiner python

  • 0

I’m trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:

ipython stack trace:

/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
    331                 break
    332         else:
--> 333             raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
    334         if self.catalog.get('Type') is not LITERAL_CATALOG:
    335             if STRICT:

PDFSyntaxError: No /Root object! - Is this really a PDF?

Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.

Is there any way to read these PDFs despite the absence of a root object? I’m not too sure where to go from here.

Many thanks!

Edit:

I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:

In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
    372         self.flattenedPages = None
    373         self.resolvedObjects = {}
--> 374         self.read(stream)
    375         self.stream = stream
    376         self._override_encryption = False

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
    708             line = self.readNextEndLine(stream)
    709         if line[:5] != "%%EOF":
--> 710             raise utils.PdfReadError, "EOF marker not found"
    711 
    712         # find startxref entry - the location of the xref table


PdfReadError: EOF marker not found

Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I’m very much clueless. Any thoughts?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T19:02:28+00:00Added an answer on June 7, 2026 at 7:02 pm

    interesting problem. i had performed some kind of research:

    function which parsed pdf (from miners source code):

    def set_parser(self, parser):
            "Set the document to use a given PDFParser object."
            if self._parser: return
            self._parser = parser
            # Retrieve the information of each header that was appended
            # (maybe multiple times) at the end of the document.
            self.xrefs = parser.read_xref()
            for xref in self.xrefs:
                trailer = xref.get_trailer()
                if not trailer: continue
                # If there's an encryption info, remember it.
                if 'Encrypt' in trailer:
                    #assert not self.encryption
                    self.encryption = (list_value(trailer['ID']),
                                       dict_value(trailer['Encrypt']))
                if 'Info' in trailer:
                    self.info.append(dict_value(trailer['Info']))
                if 'Root' in trailer:
                    #  Every PDF file must have exactly one /Root dictionary.
                    self.catalog = dict_value(trailer['Root'])
                    break
            else:
                raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
            if self.catalog.get('Type') is not LITERAL_CATALOG:
                if STRICT:
                    raise PDFSyntaxError('Catalog not found!')
            return
    

    if you will be have problem with EOF another exception will be raised:
    ”’another function from source”’

    def load(self, parser, debug=0):
            while 1:
                try:
                    (pos, line) = parser.nextline()
                    if not line.strip(): continue
                except PSEOF:
                    raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
                if not line:
                    raise PDFNoValidXRef('Premature eof: %r' % parser)
                if line.startswith('trailer'):
                    parser.seek(pos)
                    break
                f = line.strip().split(' ')
                if len(f) != 2:
                    raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
                try:
                    (start, nobjs) = map(long, f)
                except ValueError:
                    raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
                for objid in xrange(start, start+nobjs):
                    try:
                        (_, line) = parser.nextline()
                    except PSEOF:
                        raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
                    f = line.strip().split(' ')
                    if len(f) != 3:
                        raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
                    (pos, genno, use) = f
                    if use != 'n': continue
                    self.offsets[objid] = (int(genno), long(pos))
            if 1 <= debug:
                print >>sys.stderr, 'xref objects:', self.offsets
            self.load_trailer(parser)
            return
    

    from wiki(pdf specs):
    A PDF file consists primarily of objects, of which there are eight types:

    Boolean values, representing true or false
    Numbers
    Strings
    Names
    Arrays, ordered collections of objects
    Dictionaries, collections of objects indexed by Names
    Streams, usually containing large amounts of data
    The null object
    

    Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.

    i thk the problem is your “damaged pdf” have a few ‘root elements’ on the page.

    Possible solution:

    you can download sources and write `print function’ in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).

    ps: i think it some kind of bug in product.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm new to using the Perl treebuilder module for HTML parsing and can't figure
For some reason, after submitting a string like this Jack’s Spindle from a text
I have a text area in my form which accepts all possible characters from
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I have a bunch of posts stored in text files formatted in yaml/textile (from
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.