Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3406964
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T05:41:33+00:00 2026-05-18T05:41:33+00:00

Can I use a MapReduce framework to create an index and somehow add it

  • 0

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?

I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.

Here is my proposed architecture:

  • Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
  • Index those documents into a Lucene.NET / Lucene (java) compatible file format
  • Deploy that file to all my Solr instances
  • Activate that replicated index

If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.

Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc…), index them, copy the index to my Solr instances, and somehow “activate” them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.

The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.

Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?

What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it’s saying.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T05:41:34+00:00Added an answer on May 18, 2026 at 5:41 am

    Generally, you what you’ve described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that’s based on Hadoop core.

    You shouldn’t mix Cloudera, Hadoop, Nutch and Lucene. You’ll most likely end up using all of them:

    • Nutch is the name of indexing / answering (like Solr) machinery.
    • Nutch itself runs using a Hadoop cluster (which heavily uses it’s own distributed file system, HDFS)
    • Nutch uses Lucene format of indexes
    • Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
    • Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you’d most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).

    Generally, if you’re just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.

    I personally don’t see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I’d managed such a project, I’d considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don’t want to crawl the whole internet for some purpose).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I can use VS08's MFC/ActiveX template to create a C++ ActiveX object that I
You can use more than one css class in an HTML tag in current
You can use a standard dot notation or a method call in Objective-C to
I can use properties of an Excel Worksheet to tell if the worksheet is
You can use ftplib for full FTP support in Python. However the preferred way
You can use SelectFolder() to get a folder or GetOpenFolderitem(filter as string) to get
You can use App.config; but it only supports key/value pairs. You can use .Net
I can use FlashWindowEx to make a window flash in the taskbar, but what
You can use XPath if you're binding the XML document in the XAML, but
You can use command lsof to get file descriptors for all running processes, but

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.