I am trying to find collocations with NLTK in a text by using the built-in method.
Now I am having the following example text (test and foo follow each other, but there is a sentence border in between):
content_part = """test. foo 0 test. foo 1 test.
foo 2 test. foo 3 test. foo 4 test. foo 5"""
Result from tokenization and collocations() is as follows:
print nltk.word_tokenize(content_part)
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.',
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5']
print nltk.Text(nltk.word_tokenize(content_part)).collocations()
# test. foo
How can I prevent NLTK from:
- Including the dot in my tokenization
- Not find collocations() over sentence borders?
So in this example it should not print any collocation at all, but I guess you can imagine more complicated texts where there are also collocations within sentences.
I can guess that I need to use the Punkt sentence segmenter, but then I do not know how to put them together again to find collocations with nltk (collocation() seems to be more mighty than just counting stuff myself).
You could use WordPunctTokenizer to separate the punctuation from words and later filter out the bigrams with punctuation with apply_word_filter().
Same thing may be used for trigrams for not finding collocations over sentence borders.
Output: