I know this is not a new concept by any stretch in R, and

Question

0

Asked: May 27, 20262026-05-27T05:03:39+00:00 2026-05-27T05:03:39+00:00

I know this is not a new concept by any stretch in R, and

0

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.

Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:

Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting

Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.

Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won’t take offense if you describe solutions at a 3rd grade level either.

Thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T05:03:40+00:00

If you need to operate on the entire 10GB file at once, then I second @Chase’s point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data — for example, you want to first group the tweets by author, date/time, etc — you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:

an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I know this is not a new concept by any stretch in R, and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply