Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6554919
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T12:45:26+00:00 2026-05-25T12:45:26+00:00

This will be a tricky question but I will try anyway: our task is

  • 0

This will be a tricky question but I will try anyway:
our task is to feed Microsoft FAST ESP with gigabytes of data. The final amount of indexed data is somewhere in the neighborhood of 50-60GB.

FAST has a .NET API but core components are written in Python (processing pipelines to index documents). The challenge is to reliably communicate with the system while feeding it gigabytes of data for indexing.

The problems that arise with FAST here are:

  1. the system is quirky when it is fed too much data at once as it
    wants to reindex its data during which the system remains unreachable
    for hours. Unacceptable.

  2. it is not an option to queue up all data and serially feed one item
    at a time since this will take too long (several days).

  3. when an item cannot be be indexed by FAST the client has to re-feed the
    item. For this to work, the system is supposed to call a callback
    method to inform the client about the failure. However, whenever the
    system times out the feeding client is unable to react to the timeout
    because that callback is never called. Hence the client is starving.
    Data is in the queue but cannot be passed along to the system. The
    queue collapses. Data is lost. You get the idea.

Notes:

  1. feeding an item can take seconds for a small item and up to 5-8
    hours for a single large item.
  2. the items being indexed are both binary and text based.
  3. the goal is for the full indexing to take "only" 48-72h, i.e. it
    must happen over the weekend.
  4. The FAST document processing pipelines (Python code) here have
    around 30 stages each. There are a total of 27 pipelines as of this
    writing.

In summary:

The major challenge is to feed the system with items, big and small,
at just the right speed (not too fast because it might collapse or run
into memory issues; not too slow because this will take too long),
simultaneously, in a parallel manner like asynchronously running threads. In
my opinion there has to be an algorithm that decides when to feed
what items and how many at once. Parallel programming comes to mind.

There could also be multiple "queues" where each queue (process) is dedicated to
certain-sized items which are loaded in a queue and then fed one by one (in worker threads).

I am curious if anyone has ever done anything like this, or how how you would go about a problem like this.

EDIT: Again, I am not looking to "fix" FAST ESP or improve its inner
workings. The challenge is to effectively use it!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T12:45:27+00:00Added an answer on May 25, 2026 at 12:45 pm

    It sounds like you’re working with a set of issues more than a specific C# feeding speed issue.

    A few questions up front – is this 60gb data to be consumed every weekend or is it an initial backfill of the system ? Does the data exist as items on the filesystem local to the ESP install or elseware ? Is this a single internal ESP deployment or a solution you’re looking to replicate in multiple places ? Single node install or multiple (or rather … how many – single node cap is 20 docprocs) ?

    ESP performance is usually limited by number of documents to be handled more than the number of files. Assuming your data ranges between email size 35k data and filesystem size 350k data you 60gb equates to between 180k docs and 1.8mil docs, so to feed that over 48hrs you need to feed between 3750 and 37500 documents per hour. Not a very high target on modern hardware (if you installed this on a VM … well… all bets are off, it’d be better off on a laptop).

    For feeding you have a choice between faster coding & more control with either managing the batches fed yourself or using the DocumentFeeder framework in the api which abstracts a lot of the batch logic. If you’re just going for 37.5k docs/hr I’d save the overhead and just use DocumentFeeder – though take care in its config params. Document feeder will allow you to treat your content on a per document basis instead of creating the batches yourself, it will also allow for some measure of automatically retrying based on config. General target should be for a max of 50mb content per batch or 100 docs, whichever comes first. Larger docs should be sent in smaller batches… so if you have a 50mb file, it should ideally be sent by itself, etc. You’d actually lose the control of the batches formed by document feeder… so the logic there is kinda a best effort on the part of your code.

    Use the callbacks to monitor how well the content is making it into the system. Set limits on how many documents have been fed that you haven’t received the final callbacks for yet. Target should be for X batches to be submitted at any given time -or- Y Mb, pause at either cutoff. X should be about 20 + # of document processors, Y should be in the area of 500-1000Mb. With document feeder it’s just a pass/fail per doc, with the traditional system it’s more detailed. Only wait for the ‘secured’ callback … that tells you it’s been processed & will be indexed… waiting for it to be searchable is pointless.

    Set some limits on your content… in general ESP will break down with very large files, there’s a hard limit at 2gb since it’s still 32bit procs, but in reality anything over 50mb should only have the metadata fed in. Also… avoid feeding log data, it’ll crush the internal structures, killing perf if not erroring out. Things can be done in the pipeline to modify what’s searchable to ease the pain of some log data.

    Also need to make sure your index is configured to well, at least 6 partitions with a focus on keeping the lower order ones fairly empty. Hard to go into the details of that one without knowing more about the deployment. The pipeline config can have a big impact as well… no document should ever take 5-8 hours. Make sure to replace any searchexport or htmlexport stages being used with custom instances with a sane timeout (30-60 sec) – default is no timeout.

    Last point… odds are that no matter how your feeding is configured, the pipeline will error out on some documents. You’ll need to be prepared to either accept that or refeed just the metadata (there are other options, but kinda outside the scope here).

    good luck.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This will probably be a bot of a waffly question but ill try my
This will be probable quite odd question. But i thought I will give it
Would anyone happen to know a trick that will keep this MSBuild task from
This will probably be obvious but I can't find the best way. I want
This will be implemented in Javascript (jQuery) but I suppose the method could be
This will be a bit subjective, I'm afraid, but I'd value the advice of
This is a tricky question to ask. A friend of mine seeks help, and
This is easy in .NET (not my question) but I'm trying to figure out
I apologise in advance for the poor quality of this question but here goes:
This might be a silly question, but: Suppose an expression depends only on literals,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.