Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8015015
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T20:07:31+00:00 2026-06-04T20:07:31+00:00

Revisiting a stalled project and looking for advice in modernizing thousands of old documents

  • 0

Revisiting a stalled project and looking for advice in modernizing thousands of “old” documents and making them available via web.

Documents exist in various formats, some obsolete: (.doc, PageMaker, hardcopy (OCR), PDF, etc.). Funds are available to migrate the documents into a ‘modern’ format, and many of the hardcopies have already been OCR’d into PDFs – we had originally assumed that PDF would be the final format but we’re open to suggestions (XML?).

Once all docs are in a common format we would like to make their contents available and searchable via a web interface. We’d like the flexibility to return only portions (pages?) of the entire document where a search ‘hit’ is found (I believe Lucene/elasticsearch makes this possible?!?) Might it be more flexible if content was all XML? If so how/where to store the XML? Directly in database, or as discrete files in the filesystem? What about embedded images/graphs in the documents?

Curious how others might approach this. There is no “wrong” answer I’m just looking for as many inputs as possible to help us proceed.

Thanks for any advice.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T20:07:33+00:00Added an answer on June 4, 2026 at 8:07 pm

    In summary: I’m going to be recommending ElasticSearch, but let’s break the problem down and talk about how to implement it:

    There are a few parts to this:

    1. Extracting the text from your docs to make them indexable
    2. Making this text available as full text search
    3. Returning highlighted snippets of the doc
    4. Knowing where in the doc those snippets are found to allow
      for paging
    5. Return the full doc

    What can ElasticSearch provide:

    1. ElasticSearch (like Solr) uses Tika to extract text and metadata from a wide variety of doc formats
    2. It, pretty obviously, provides powerful full text search. It can be configured
      to analyse each doc in the appropriate language with, stemming, boosting the relevance of certain fields (eg title more important than content), ngrams etc. ie standard Lucene stuff
    3. It can return highlighted snippets for each search result
    4. It DOESN’T know where those snippets occur in your doc
    5. It can store the original doc as an attachment, or it can store and return the extracted text. But it’ll return the whole doc, not a page.

    You could just send the whole doc to ElasticSearch as an attachment, and you’d get full text search. But the sticking points are (4) and (5) above: knowing where you are in a doc, and returning parts of a doc.

    Storing individual pages is probably sufficient for your where-am-I purposes (although you could equally go down to paragraph level), but you want them grouped in a way that a doc would be returned in the search results, even if search keywords appear on different pages.

    First the indexing part: storing your docs in ElasticSearch:

    1. Use Tika (or whatever you’re comfortable with) to extract the text from each doc. Leave it as plain text, or as HTML to preserve some formatting. (forget about XML, no need for it).
    2. Also extract the metadata for each doc: title, authors, chapters, language, dates etc
    3. Store the original doc in your filesystem, and record the path so that you can serve it later
    4. In ElasticSearch, index a “doc” doc which contains all of the metadata, and possibly the list of chapters
    5. Index each page as a “page” doc, which contains:

      • A parent field which contains the ID of the “doc” doc (see “Parent-child relationship” below)
      • The text
      • The page number
      • Maybe the chapter title or number
      • Any metadata which you want to be searchable

    Now for searching. How you do this depends on how you want to present your results – by page, or grouped by doc.

    Results by page are easy. This query returns a list of matching pages (each page is returned in full) plus a list of highlighted snippets from the page:

    curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
    {
       "query" : {
          "text" : {
             "text" : "interesting keywords"
          }
       },
       "highlight" : {
          "fields" : {
             "text" : {}
          }
       }
    }
    '
    

    Displaying results grouped by “doc” with highlights from the text is a bit trickier. It can’t be done with a single query, but a little client side grouping will get you there. One approach might be:

    Step 1: Do a top-children-query to find the parent (“doc”) whose children (“page”) best match the query:

    curl -XGET 'http://127.0.0.1:9200/my_index/doc/_search?pretty=1'  -d '
    {
       "query" : {
          "top_children" : {
             "query" : {
                "text" : {
                   "text" : "interesting keywords"
                }
             },
             "score" : "sum",
             "type" : "page",
             "factor" : "5"
          }
       }
    }
    

    Step 2: Collect the “doc” IDs from the above query and issue a new query to get the snippets from the matching “page” docs:

    curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
    {
       "query" : {
          "filtered" : {
             "query" : {
                "text" : {
                   "text" : "interesting keywords"
                }
             },
             "filter" : {
                "terms" : {
                   "doc_id" : [ 1,2,3],
                }
             }
          }
       },
       "highlight" : {
          "fields" : {
             "text" : {}
          }
       }
    }
    '
    

    Step 3: In your app, group the results from the above query by doc and display them.

    With the search results from the second query, you already have the full text of the page which you can display. To move to the next page, you can just search for it:

    curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
    {
       "query" : {
          "constant_score" : {
             "filter" : {
                "and" : [
                   {
                      "term" : {
                         "doc_id" : 1
                      }
                   },
                   {
                      "term" : {
                         "page" : 2
                      }
                   }
                ]
             }
          }
       },
       "size" : 1
    }
    '
    

    Or alternatively, give the “page” docs an ID consisting of $doc_id _ $page_num (eg 123_2) then you can just retrieve that page:

    curl -XGET 'http://127.0.0.1:9200/my_index/page/123_2
    

    Parent-child relationship:

    Normally, in ES (and most NoSQL solutions) each doc/object is independent – there are no real relationships. By establishing a parent-child relationship between the “doc” and the “page”, ElasticSearch makes sure that the child docs (ie the “page”) are stored on the same shard as the parent doc (the “doc”).

    This enables you to run the top-children-query which will find the best matching “doc” based on the content of the “pages”.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Recently I've been revisiting an old project, which I last worked on about two
I'm revisiting an an older project and converting to ARC, my first time through
I am in the process of revisiting Python web development and I am building
I'm looking into freelancing PHP projects on my spare time so I'm revisiting my
I'm currently revisiting an area of my Windows-based software and looking at changing the
I'm revisiting some old code, and have found it doesn't work with jQuery 1.6
So Scala is supposed to be as fast as Java. I'm revisiting some Project
I am revisiting a project and need to limit it to Java 1.4 (unfortunately).
I am revisiting an old thread of mine . I want to launch a
I'm revisiting som old code of mine and have stumbled upon a method for

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.