I am trying to find collocations with NLTK in a text by using the

Question

0

Asked: May 29, 20262026-05-29T09:05:16+00:00 2026-05-29T09:05:16+00:00

I am trying to find collocations with NLTK in a text by using the

0

I am trying to find collocations with NLTK in a text by using the built-in method.

Now I am having the following example text (test and foo follow each other, but there is a sentence border in between):

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test. foo 5"""

Result from tokenization and collocations() is as follows:

print nltk.word_tokenize(content_part)
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.',
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5']

print nltk.Text(nltk.word_tokenize(content_part)).collocations()
# test. foo

How can I prevent NLTK from:

Including the dot in my tokenization
Not find collocations() over sentence borders?

So in this example it should not print any collocation at all, but I guess you can imagine more complicated texts where there are also collocations within sentences.

I can guess that I need to use the Punkt sentence segmenter, but then I do not know how to put them together again to find collocations with nltk (collocation() seems to be more mighty than just counting stuff myself).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T09:05:17+00:00

You could use WordPunctTokenizer to separate the punctuation from words and later filter out the bigrams with punctuation with apply_word_filter().

Same thing may be used for trigrams for not finding collocations over sentence borders.

from nltk import bigrams
from nltk import collocations
from nltk import FreqDist
from nltk.collocations import *
from nltk import WordPunctTokenizer

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test, foo 4 test."""

tokens = WordPunctTokenizer().tokenize(content_part)

bigram_measures = collocations.BigramAssocMeasures()
word_fd = FreqDist(tokens)
bigram_fd = FreqDist(bigrams(tokens))
finder = BigramCollocationFinder(word_fd, bigram_fd)

finder.apply_word_filter(lambda w: w in ('.', ','))

scored = finder.score_ngrams(bigram_measures.raw_freq)

print tokens
print sorted(finder.nbest(bigram_measures.raw_freq,2),reverse=True)

Output:

['test', '.', 'foo', '0', 'test', '.', 'foo', '1', 'test', '.', 'foo', '2', 'test', '.', 'foo', '3', 'test', '.', 'foo', '4', 'test', ',', 'foo', '4', 'test', '.']
[('4', 'test'), ('foo', '4')]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to find collocations with NLTK in a text by using the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply