I get a file via a HTTP upload and need to make sure its a PDF file. The programing language is Python, but this should not matter.
I thought of the following solutions:
-
Check if the first bytes of the string are
%PDF. This is not a good check but prevents the user from uploading other files accidentally. -
Use
libmagic(thefilecommand inbashuses it). This does exactly the same check as in (1) -
Use a library to try to read the page count out of the file. If the lib is able to read a page count it should be a valid PDF file. Problem: I don’t know a Python library that can do this
Are there solutions using a library or another trick?
The two most commonly used PDF libraries for Python are:
Both are pure python so should be easy to install as well be cross-platform.
With pypdf it would probably be as simple as doing:
This should be enough, but
readerwill now have themetadataandpagesattributes if you want to do further checking.As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF’s due to system overhead of forking a new process, etc.