Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5967941
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T20:01:01+00:00 2026-05-22T20:01:01+00:00

A blog post – http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html – suggests calling external systems (querying the twitter API,

  • 0

A blog post – http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html – suggests calling external systems (querying the twitter API, or crawling webpages) from within a Hadoop cluster.

For the system I’m currently developing, there are both fast, and slow(bulk) sub-systems. Data is fetched from Twitter’s API -also for quick, individual retrievals. This can be hundreds of thousands (even millions) of external requests per day. The content of web pages are also retrieved for further processing – with at least the same scale of requests.

Aside from potential side-effects to the external source (changing data so it’s different on the next request), what would be the pluses, or minuses of using Hadoop in such a way? Is it a valid and useful method of bulk, and/or fast retrieval of data?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T20:01:02+00:00Added an answer on May 22, 2026 at 8:01 pm

    The plus: it’s a super easy way to distribute the work that needs to be done.

    The minus: due to the way that Hadoop recovers from failures, you need to be very careful about managing what is and isn’t run (which you can definitely do, it’s just something to watch out for). If a reduce fails, for example, then all of the map jobs that feed that partition must also be rerun. Obviously this would most likely be a no-reducer job, but this is still true of mappers…what happens if half of the calls run, then the job fails, so it is rescheduled?

    You could use some sort of high-throughput system to manage the calls that are actually made or somesuch. But it definitely can be appropriately used for this.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

In response to this blog post: http://www.simonecarletti.com/blog/2009/02/capistrano-uploads-folder/ I have three questions: Can anyone confirm
Both domain.com/blog/post/2011/01/25/This-Is-The-Post-Title!.aspx domain.com/blog/post/2011/01/25/This-Is-The-Post-Title.aspx need to be redirected to domain.com/blog/2011/01/25/this-is-the-post-title The following rule works for
After reading this blog post: http://www.sitepoint.com/javascript-shared-web-workers-html5/ I don't get it. What's the difference between
Heiko Seeberger wrote a great blog post on category theory here: https://hseeberger.wordpress.com/2010/11/25/introduction-to-category-theory-in-scala/ In it,
Is there a paper/blog-post on when to use Cassandra or Membase or Hadoop or
seeing this blog post : http://mine.tuxfamily.org/?p=111 , I'm trying to disable the pivot flick
Each blog post on my site -- http://www.correlated.org -- is archived at its own
I am creating a blog post scheduling system using CodeIgniter. I want 10 posts
This blog post demonstrates a way to implement a mutex per string id idiom.
This blog post of December 2008 says that rubygems is broken on Debian-based systems.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.