Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9120711
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T05:41:07+00:00 2026-06-17T05:41:07+00:00

First of all, I am not aiming for a specific development answer, but rather

  • 0

First of all, I am not aiming for a specific development answer, but rather a development approach.

The problem that I am having, is I have a client with a enormous amount of articles in PDFs, about 150 articles in fifty pdfs per year for the last 20 years. All of these PDFs are compiled from Quark express, from people with macs (if that info matters). Every time a new pdf magazine is created, the web-development team copy and pastes (!) each article into a form on the internet (!), incl. title, content, keywords, references, authorname, etc. It usually takes about 3 full days for one guy to finish the job.

When I was working there (I am not anymore, this was nearly seven years ago), I speeded the process up three fold using a clipboard monitoring app, and some simple XML-based PHP scripts that interact with the server. All you needed to do then, was select text, CTRL+C, select some more text, CTRL+C, go to the app (ALT+TAB), press ‘next article’, and repeat this. But we, or mostly I, still spend about fifty days per year processing PDF magazines.

Now I’m seven years down the line, and I am about to speak to my old boss again, for friendly visiting reasons. I know they are still using my apps (!). But perhaps it is a nice idea to look into their problem back again, and see if I can suggest a coding project that could help them?

I have never used Quark Express, I only know that it is something similar as to MS Word, that’s as far as my knowledge about the software goes. I am not extremely familiar with unencrypted, extracted PDF code/syntax.

In short: Does Quark Express have some specific compilation patterns, that can be used in the PDF scripts to extract articles? What ‘intelligent’ tools are there, that can ‘learn’ from similarly structured pdf pages, where the article contents are? Are there tools out there, like Quark Xpress modules of some sort, that can ‘encapsulate’ or ‘mark’ an article together, with an invisible reference tag, to make extraction a lot simpler for scripts?

The people creating these PDFs have been doing their job for the past 20 years, and unwilling to change their working flow, except for software updates. Any additional tool for them must not interfere with their workflow, or they will just refuse it.

I don’t want code; but merely some descriptions of what you or other people perhaps have done with regards to other PDF extraction problems. The best answer would be a description of maybe several methods, or some references to a external links with case descriptions.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T05:41:08+00:00Added an answer on June 17, 2026 at 5:41 am

    Broad question, but at first sight my answer would be that – if you let them go as far as the PDF – you’re making things very difficult already. If they are still using Quark XPress, there are far better ways to do this kind of thing and similar approaches are actually be used by quite a few publishers out there.

    1) Look into generating both PDF and XML out of Quark XPress. It’s fine that they don’t want to change their ways but they have to create PDF out of Quark anyway; also generating XML is not a really big additional step. In fact (warning – affiliation!) there are tools who can make all of this into one step. You could write AppleScript for example to steer the process, but something like axaio MadeToPrint will automatically generate both the (correct) PDF and an XML file after people clicking “export”.

    2) Once you have the PDF and the XML of the same content, use the PDF for print (just as know) and then write some code to convert the XML into whatever you need on the web site. If the coding is done on the web site itself, you might not even need to tweak the XML coming out of Quark; simply make the site smart enough to pick up whatever bits and pieces are necessary.

    Broad answer on a broad question; hope that was what you are looking for…

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

First of all: I am not an experienced ClearCase user, but I have lots
First of all: This question is not directly programming related. However, the problem only
First of all I'm not a Java programmer, so maybe (hopefully :D) this problem
First of All I'm not good in action scripting. I have a flash videoplayer
First of all Merry Christmas to everyone!!! Currently I have a NSArray that has
First of all this is not 'homework', its a problem from Thinking in C++
First of all, not all tasks are listed by rake -T. But even if
first of all I am not sure if this is the right approach. What
First of all I am not really good in english but Ill try to
First of all I am not an expert on Entity Framework. And I have

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.