I receive encoded PDF files regularly. The encoding works like this:
- the PDFs can be displayed correctly in Acrobat Reader
- select all and copy the test via Acrobat Reader
- and paste in a text editor
- will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it’s basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. “Name:”, “Summary:”, “Total:”, inside the PDF.
Thank you!
edit: thanks for the feedback. I’d try to break the question into smaller questions:
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
You could easily implement like this to check against knowns words