I’m looking for a way to order Google Book’s Ngram’s by frequency. The original

Question

0

Asked: June 11, 20262026-06-11T12:25:50+00:00 2026-06-11T12:25:50+00:00

I’m looking for a way to order Google Book’s Ngram’s by frequency. The original

0

I’m looking for a way to order Google Book’s Ngram’s by frequency.

The original dataset is here: http://books.google.com/ngrams/datasets. Inside each file the ngrams are sorted alphabetically and then chronologically.

My computer is not powerful enough to handle 2.2 TB worth of data, so I think the only way to sort this would be “in the cloud”.

The AWS-hosted version is here: http://aws.amazon.com/datasets/8172056142375670.

Is there a financially efficient way to find the 10,000 most frequent 1grams, 2grams, 3grams, 4grams, and 5grams?

To throw a wrench in it, the datasets contain data for multiple years:

As an example, here are the 30,000,000th and 30,000,001st lines from file 0 
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):

circumvallate   1978   313    215   85 
circumvallate   1979   183    147   77

The first line tells us that in 1978, the word "circumvallate" (which means 
"surround with a rampart or other fortification", in case you were wondering) 
occurred 313 times overall, on 215 distinct pages and in 85 distinct books 
from our sample.

Ideally, the frequency lists would only contain data from 1980-present (the sum of each year).

Any help would be appreciated!

Cheers,

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T12:25:52+00:00

I would recommend using Pig!

Pig makes things like this very easy and straight-forward. Here’s a sample pig script that does pretty much what you need:

raw = LOAD '/foo/input' USING PigStorage('\t') AS (ngram:chararray, year:int, count:int, pages:int, books:int);
filtered = FILTER raw BY year >= 1980;
grouped = GROUP filtered BY ngram;
counts = FOREACH grouped GENERATE group AS ngram, SUM(filtered.count) AS count;
sorted = ORDER counts BY count DESC;
limited = LIMIT sorted 10000;
STORED limited INTO '/foo/output' USING PigStorage('\t');

Pig on AWS Elastic MapReduce can even operate directly on S3 data, so you would probably replace /foo/input and /foo/output with S3 buckets too.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m looking for a way to order Google Book’s Ngram’s by frequency. The original

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply