A blog post – http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html – suggests calling external systems (querying the twitter API,

Question

0

Asked: May 22, 20262026-05-22T20:01:01+00:00 2026-05-22T20:01:01+00:00

A blog post – http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html – suggests calling external systems (querying the twitter API,

0

A blog post – http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html – suggests calling external systems (querying the twitter API, or crawling webpages) from within a Hadoop cluster.

For the system I’m currently developing, there are both fast, and slow(bulk) sub-systems. Data is fetched from Twitter’s API -also for quick, individual retrievals. This can be hundreds of thousands (even millions) of external requests per day. The content of web pages are also retrieved for further processing – with at least the same scale of requests.

Aside from potential side-effects to the external source (changing data so it’s different on the next request), what would be the pluses, or minuses of using Hadoop in such a way? Is it a valid and useful method of bulk, and/or fast retrieval of data?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T20:01:02+00:00

The plus: it’s a super easy way to distribute the work that needs to be done.

The minus: due to the way that Hadoop recovers from failures, you need to be very careful about managing what is and isn’t run (which you can definitely do, it’s just something to watch out for). If a reduce fails, for example, then all of the map jobs that feed that partition must also be rerun. Obviously this would most likely be a no-reducer job, but this is still true of mappers…what happens if half of the calls run, then the job fails, so it is rescheduled?

You could use some sort of high-throughput system to manage the calls that are actually made or somesuch. But it definitely can be appropriately used for this.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

A blog post – http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html – suggests calling external systems (querying the twitter API,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply