How does something like Statistically Improbable Phrases work?
According to amazon:
Amazon.com’s Statistically Improbable
Phrases, or “SIPs”, are the most
distinctive phrases in the text of
books in the Search Inside!™ program.
To identify SIPs, our computers scan
the text of all books in the Search
Inside! program. If they find a phrase
that occurs a large number of times in
a particular book relative to all
Search Inside! books, that phrase is a
SIP in that book.SIPs are not necessarily improbable
within a particular book, but they are
improbable relative to all books in
Search Inside!. For example, most SIPs
for a book on taxes are tax related.
But because we display SIPs in order
of their improbability score, the
first SIPs will be on tax topics that
this book mentions more often than
other tax books. For works of fiction,
SIPs tend to be distinctive word
combinations that often hint at
important plot elements.
For instance, for Joel’s first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules
One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.
It’s a lot like the way Lucene ranks documents for a given search query. They use a metric called TF-IDF, where TF is term frequence and idf is inverse document frequency. The former ranks a document higher the more the query terms appear in that document, and the latter ranks a document higher if it has terms from the query that appear infrequently across all documents. The specific way they calculate it is log(number of documents / number of documents with the term) – ie, the inverse of the frequency that the term appears.
So in your example, those phrases are SIPs relative to Joel’s book because they are rare phrases (appearing in few books) and they appear multiple times in his book.
Edit: in response to the question about 2-grams and 3-grams, overlap doesn’t matter. Consider the sentence “my two dogs are brown”. Here, the list of 2-grams is [“my two”, “two dogs”, “dogs are”, “are brown”], and the list of 3-grams is [“my two dogs”, “two dogs are”, “dogs are brown”]. As I mentioned in my comment, with overlap you get N-1 2-grams and N-2 3-grams for a stream of N words. Because 2-grams can only equal other 2-grams and likewise for 3-grams, you can handle each of these cases separately. When processing 2-grams, every “word” will be a 2-gram, etc.