Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8706275
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T03:34:34+00:00 2026-06-13T03:34:34+00:00

I am writing a web script for a small online documents management company that

  • 0

I am writing a web script for a small online documents management company that wants to allow users to quickly search the content of their files online.
While many accounts are very small (under 100 2MB files), there’s a handful that have 1,000,000 files or more. Support for PDF and DOC/DOCX is needed. Binary files won’t be indexed.

We’re looking for a simple solution that provides basic search results. Nothing too fancy.
Each user has a home folder (and a search would search his subfolders only) so keep in mind that the search system should be optimal for that.. To illustrate, if a guy with a 100 MB account searches his home folder, it would make sense to not search the other 4 TB of files.

What do you suggest?

Here’s some options I was looking at:

1) I was thinking of using Windows Search for this- either a command line tool or using an API.. But each server can have literally 1 billion files and the top 3 results should be delivered instantly. Will Windows Search do? Or will this yield frustration?

2) Custom: Making a simple open-source MySQL database program to hold index information.
There are about 100,000 words in the English language… Then there’s custom words and acronyms.. So for a fast lookup, it makes sense to index based on word and user account.
I will pre-process so that “jogging” becomes “jog” and “fiddling” becomes “fiddle”, to lower the DB size.
Given 150 customer accounts per server, would it make sense to have one big DB, or maybe eliminate the UserID field and give each user a DB?

Tables:
Table WorldTable
EnglishWord (pk) | WordID (fk)

Table FileTable
FileID (pk) | FilePath

Table WordIndex
WordID (pk) | FileID (fk) | UserID | SettingsPatternID

Table Settings
SettingsPatternID | Top (bool) | IsWordForm (bool)

IsWordForm = Indicates it’s not an exact match, but a form of the word. Ex: Word in file was “jogging”, or “dancing” originally in the document, but is filed under the short form “jog” or “dance”. (If the query was also a wordform, then it helps with relevancy.) The likelihood of a IsWordForm is high.
Top = Word is at top 50 words of document (indicates title)

I’d like a small storage overhead of 5-15%. CPU is very precious…
But, for each file, that’s lots of overhead since each file will generate thousands of records in the WordIndex.. Ie:

WordID, FileID, UserID, SettingsPatternID
WordID, FileID, UserID, SettingsPatternID
WordID, FileID, UserID, SettingsPatternID

…
This is the longest table, and WordID is needlessly repeated.

3) Hashing, with MySQL
Since we know it will be a search of words, a pure relational database might not be the best model…

it may be more efficient to “hash” each word to a list of matching files.
Ex: For each word, make a 2-column table. You don’t need to “look up” the word in a table, since we know what it is.
This list could be a 2-column table for each word:

Table *The Word*
FileID | UserID | SettingsPatternID
(There would be 100,000 of these. One for each unique word.)

Table Settings
SettingsPatternID | Top (bool) | IsWordForm (bool)

4) I’ve also looked at SolR but I think it’s overkill. Is that a bad assumption? While it supports PDF and DOC, it’s also a fair bit of work to integrate… I almost feel it will be the same amount of work to do it myself, but of course as a coder I know that assumption’s wrong too often…

Thoughts please!!!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T03:34:35+00:00Added an answer on June 13, 2026 at 3:34 am

    4) I’ve also looked at SolR but I think it’s overkill. Is that a bad
    assumption? While it supports PDF and DOC, it’s also a fair bit of
    work to integrate… I almost feel it will be the same amount of work
    to do it myself, but of course as a coder I know that assumption’s
    wrong too often…

    Definitely go with SolR: it is more costly to integrate, but it will be easier to setup, and much easier to maintain.

    Moreover, it already has many of the features you’d have to otherwise implement (and debug, and maintain…) by yourself.

    I’d suggest, however, to review SolR’s features, design a basic interface around those features, and have it approved in writing. “Text searching” too often becomes an unspoken “I want the system to be able to read my mind“. Also, explain that efficient text searching is not a “simple script”; there’s literally thousands of Ph.D. papers involving semantics, stemming, relevance, proximity and so on. Many of those papers have found their way into SolR/Lucene.

    SolR is “overkill” if you assume that users might be satisfied by grep, both performance-wise, scalability-wise and result-wise. Trust me, they won’t.

    You may try suggesting a Google Machine. It will also help establish a baseline relative to costs: i.e., “if you want Google performances, this is the price of Google. Any other ad hoc implementation without Google’s economies of scale would cost far more to achieve the same levels of performance”.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am writing a web application that takes a script input from a client-side
hi all I'm writing a simple web crawling script that needs to connect to
All, I'm writing a web app that will receive user generated text content. Some
I'm writing a small script to learn how to parse an XHTML web page.
I am writing a script that does animations on a web-page. In the process
Greetings, I'm in the progress of writing a web server script that lets you
i am writing web-services for login script. Which i have to keep on HTTPS
I am writing a test script using Selenium web driver (IE). I had no
In a asp.net/c# web application I am writing a jQuery script. I have an
I am working on a ASP.NET/C# web application. I am writing a Jquery script.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.