I need a mechanism for extracting bibliographic metadata from PDF documents, to save people

Question

0

Asked: May 16, 20262026-05-16T12:17:44+00:00 2026-05-16T12:17:44+00:00

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people

0

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.

At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.

Ideally this would be an open source solution.

The problem is that not all PDF’s encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.

I know there’s a lot of libraries. It’s identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T12:17:45+00:00

Editorial Team

2026-05-16T12:17:45+00:00Added an answer on May 16, 2026 at 12:17 pm

We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We’ve not yet integrated it into our systems but it’s there in the world.

https://code.google.com/p/pdfssa4met/

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply