I’m looking for a way to order Google Book’s Ngram’s by frequency.
The original dataset is here: http://books.google.com/ngrams/datasets. Inside each file the ngrams are sorted alphabetically and then chronologically.
My computer is not powerful enough to handle 2.2 TB worth of data, so I think the only way to sort this would be “in the cloud”.
The AWS-hosted version is here: http://aws.amazon.com/datasets/8172056142375670.
Is there a financially efficient way to find the 10,000 most frequent 1grams, 2grams, 3grams, 4grams, and 5grams?
To throw a wrench in it, the datasets contain data for multiple years:
As an example, here are the 30,000,000th and 30,000,001st lines from file 0
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The first line tells us that in 1978, the word "circumvallate" (which means
"surround with a rampart or other fortification", in case you were wondering)
occurred 313 times overall, on 215 distinct pages and in 85 distinct books
from our sample.
Ideally, the frequency lists would only contain data from 1980-present (the sum of each year).
Any help would be appreciated!
Cheers,
I would recommend using Pig!
Pig makes things like this very easy and straight-forward. Here’s a sample pig script that does pretty much what you need:
Pig on AWS Elastic MapReduce can even operate directly on S3 data, so you would probably replace
/foo/inputand/foo/outputwith S3 buckets too.