Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 646843
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T21:39:25+00:00 2026-05-13T21:39:25+00:00

I’d like to begin thinking about how I can scale up my algorithms that

  • 0

I’d like to begin thinking about how I can scale up my algorithms that I write for data analysis so that they can be applied to arbitrarily large sets of data. I wonder what are the relevant concepts (threads, concurrency, immutable data structures, recursion) and tools (Hadoop/MapReduce, Terracota, and Eucalyptus) to make this happen, and how specifically these concepts and tools are related to each other. I have a rudimentary background in R, Python, and bash scripting and also C and Fortran programming, though I’m familiar with some basic functional programming concepts also. Do I need to change the way that I program, use a different language (Clojure, Haskell, etc.), or simply (or not so simply!) adapt something like R/Hadoop (HRIPE)… or write wrappers for Python to enable multi-threading or Hadoop access? I understand this would might involve requirements for additional hardware and I would like some basic idea of what the requirements/options available might be. My apologies for this rather large and yet vague question, but just trying to get started – thanks in advance!

  • 1 1 Answer
  • 3 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T21:39:26+00:00Added an answer on May 13, 2026 at 9:39 pm

    While languages and associated technologies/frameworks are important for scaling, they tend to pale in comparison to the importance of the algorithms, data structure, and architectures. Forget threads: the number of cores you can exploit that way is just too limited — you want separate processes exchanging messages, so you can scale up at least to a small cluster of servers on a fast LAN (and ideally a large cluster as well!-).

    Relational databases may be an exception to “technologies pale” — they can really clamp you down when you’re trying to scale up a few orders of magnitude. Is that your situation — are you worried about mere dozens or at most hundreds of servers, or are you starting to think about thousands or myriads? In the former case, you can still stretch relational technology (e.g. by horizontal and vertical sharding) to support you — in the latter, you’re at the breaking point, or well past it, and must start thinking in terms of key/value stores.

    Back to algorithms — “data analysis” cover a wide range… most of my work for Google over the last few years falls in that range, e.g. in cluster management software, and currently in business intelligence. Do you need deterministic analysis (e.g. for accounting purposes, where you can’t possibly overlook a single penny out of 8-digit figures), or can you stand some non-determinism? Most “data mining” applications fall into the second category — you don’t need total precision and determinism, just a good estimate of the range that your results can be proven to fall within, with, say, 95% probability.

    This is particularly crucial if you ever need to do “real-near-time” data analysis — near-real-time and 100% accuracy constraints on the same computation do not a happy camper make. But even in bulk/batch off-line data mining, if you can deliver results that are 95% guaranteed orders of magnitude faster than it would take for 99.99% (I don’t know if data mining can ever be 100.00%!-), that may be a wonderful tradeoff.

    The work I’ve been doing over the last few years has had a few requirements for “near-real-time” and many more requirements for off-line, “batch” analysis — and only a very few cases where absolute accuracy is an absolute must. Gradually-refined sampling (when full guaranteed accuracy is not required), especially coupled with stratified sampling (designed closely with a domain expert!!!), has proven, over and over, to be a great approach; if you don’t understand this terminology, and still want to scale up, beyond the terabytes, to exabytes and petabytes’ worth of processing, you desperately need a good refresher course in Stats 201, or whatever course covers these concepts in your part of the woods (or on iTunes University, or the YouTube offerings in university channels, or blip.tv’s, or whatever).

    Python, R, C++, whatever, only come into play after you’ve mastered these algorithmic issues, the architectural issues that go with them (can you design a computation architecture to “statistically survive” the death of a couple of servers out of your myriad, recovering to within statistically significant accuracy without a lot of rework…?), and the supporting design and storage-technology choices.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.