Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8249895
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T23:42:00+00:00 2026-06-07T23:42:00+00:00

I have a database which has raw text that needs to be analysed. For

  • 0

I have a database which has raw text that needs to be analysed. For example, I have collected the title tags of hundreds of millions of individual webpages and clustered them based on topic. I am now interested in performing some additional tests on subsets of each topic cluster. The problem is two-fold. First, I cannot fit all of the text into memory to evaluate it. Secondly, I need run several of these analyses in parallel, so even if I could fit a subset into memory, I certainly could not fit many subsets into memory.

I have been working with generators, but often it is necessary to know information about rows of data that have already been loaded and evaluated.

My question is this: What are the best methods for handling and analysing data that cannot fit into memory. The data necessarily must be extracted from some sort of database (currently mysql but likely will be switching to a more powerful solution soon.)

I am building the software that handles the data in Python.

Thank you,

EDIT

I will be researching and brainstorming on this all day and plan on continuing to post my thoughts and findings. Please leave any input or advice you might have.

IDEA 1: Tokenize words and n-grams and save to file.
For each string pulled from database, tokenize using tokens in an already existing file. If a token does not exist, create it. For each word token, combine from right to left until a single representation of all the words in a string exists. Search an existing list (that can fit in memory) that consists of reduced tokens to find potential matches and similarities. Each reduced token will contain an identifier that indicates token categories. If a reduced token (one that was created by combination of word tokens) is found to match categorically against a tokenized string of interest, but not directly, then the reduced token will be broken down into its counterparts and compared word-token by word-token to the string of interest.

I have no idea if there already exists a library or module that can do this, nor am I sure how much benefit I will gain from it. However, my priorities are: 1) conserve memory, 2) worry about runtime. Thoughts?

EDIT 2

Hadoop is definitely going to be the solution to this problem. I found some great resources on natural language processing in python and hadoop. See below:

  1. http://www.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python
  2. http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf
  3. http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
  4. https://github.com/klbostee/dumbo/wiki/Short-tutorial

Thanks for your help!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T23:42:03+00:00Added an answer on June 7, 2026 at 11:42 pm

    Map/Reduce was created for this purpose.

    The best map reduce engine is Hadoop, but it has a high learning curve and needs many nodes for it to be worth it. If this is a small project, you could use MongoDB, which is a really easy to use database and includes an internal map reduce engine which uses Javascript.
    The map reduce framework is really simple and easy to learn, but it lacks all the tools that you could get in the JDK using Hadoop.

    WARNING: You can only run one map reduce job at a time on MongoDB’s map reduce engine. This is alright for chaining jobs or medium datasets (<100GB), but it lacks Hadoop’s parallelism.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a database (SQL server express 2008) which has a column that is
I have a mysql database which has 3 tables that have to be joined
I have a database which has a table that stores medical conditions and another
I have a database which has a NOT NULL constraint on a field, and
I have a SQL Server database which has grown to more than 15GB in
I have a table (session) in a database which has almost 72,000 rows. I
I have a setting stored in database which has a value .jpg|.gif|.png . I
I have a table in SQLite database of Android which has a column RANK
We have a Oracle 9i database and OrderDetails table which has a column to
I have a table which has essentially boolean values in a legacy database. The

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.