Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7761909
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T14:15:22+00:00 2026-06-01T14:15:22+00:00

I have a series of ex-PDF documents (scientific/technical) with characters encoded as vector graphics

  • 0

I have a series of ex-PDF documents (scientific/technical) with characters encoded as vector graphics rather than in a font family. How do I convert the vector stream to characters using Open Source solutions?

I am happy for any accounts of successful solutions. These might include:

  • machine learning to discover the original font family
  • writing the stream to a canvas and using OCR
  • heuristics based on reconstructing the characters from the strokes

The characters are probably fairly “simple” (many are sanserif) and I’d be happy with reconstruction into ANSI (chars 32-127)

UPDATE: [for SO readers’ info; does not affect bounty].
I have been extracting the vectors from a single example and these consist of a stroke outlining the glyph, so that even simple glyphs such as “I” are “hollow”. I suspect this is commonly true of all vector fonts. I have verified that multiple instances of the same character have identical internal coordinates and this could be used for lookup and discrimination between fonts (the minuscule differences will show up in the decimal places). If the fonts scale precisely, and if we have the coordinates of the fonts (copyright allowing) then lookup of their internal coordinates is a powerful approach. I’d be interested if anyone has tried this.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T14:15:23+00:00Added an answer on June 1, 2026 at 2:15 pm

    Your question points out the most successful and well-known solutions to converting vector encodings into characters in the context of unknown formatting and font families. Indeed, all you lack, and all you’re asking for, is a solution that re-encodes the stream for an arbitrary (but desirably high) level of quality.

    Let’s explore each of your candidate approaches in turn, along with their possibilities:

    1. machine learning to discover the original font family

      This paper discusses the topic in more detail. The most common techniques (reference) are to construct a simple support vector machine or perform Bayesian inference for determining the classifications for each character.

      The most common area where you find these techniques used is in spam detection, where the complete body of an email is visually inspected for, for example, ASCII art or spam encoded as image content. Vectorized classification for document reading, not so much after the initial pass.

    2. writing the stream to a canvas and using OCR

      This is the most common technique with software supporting it, because the most common use case is a scanned physical document passed in for visual inspection. This fails to preserve the vector path for classification, relying instead on character recognition by the glyphs on the page.

      Several free solutions exist here, including OCR 4 Linux and the now-free tesseract-ocr. For a more complete list, including feature comparisons, see here.

    3. heuristics based on reconstructing the characters from the strokes

      For the most part, these are derived from machine learning techniques and are encoded into OCR or handwriting recognition software. Because the classification problem of character recognition for an arbitrary stream of characters is inductive in scope, these are usually limited to a specific language used to back the heuristic.

      This technique certainly exists. It’s currently in use by tools like Evernote, which allows you to upload your documents for free (up to a point) and performs the vector analysis for you.

    Due to the time consumption of the first approach in the context of a known language and likely known set of font families, I recommend pursuing (2) and (3) as your first ports of call. The easiest method would be to get a free Evernote account and upload the documents, purely to see what gets captured.

    Best of luck to you. If the current state of the art is insufficient, you may have a useful corner case worth contributing to the field. 🙂

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a series of PDFs named sequentially like so: 01_foo.pdf 02_bar.pdf 03_baz.pdf etc.
I need to create a pdf file which will have series of images, where
I'm reading Uncle Bob's The Craftsman series, and have gotten to #29 (PDF). In
The situation is as follows: I have a series of big, fat PDF files,
I have a series of functional tests against a web application that correctly run,
I have a series of performance tests I would like to show as a
We have a series of drop down controls that determine the sort order of
I have a series of datetime objects and would like to calculate the average
I have a series of ASCII flat files coming in from a mainframe to
I have a series of text that contains mixed numbers (ie: a whole part

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.