quick question that is confusing me. I have NLTK installed and it has been working fine. However I am trying to get bigrams of a corpus and want to use bigrams(corpus) basically.. but it says that bigrams is not defined when i “from nltk import bigrams”
Same with trigrams. Am I missing something? Also, how could I get bigrams from the corpus manually.
I am also looking to calculate the frequencies of bigrams trigrams and quads, but am unsure exactly how to go about this.
I have the corpus tokenized with "<s>" and "</s>" at the beginning and end appropriately. Program so far here:
#!/usr/bin/env python
import re
import nltk
import nltk.corpus as corpus
import tokenize
from nltk.corpus import brown
def alter_list(row):
if row[-1] == '.':
row[-1] = '</s>'
else:
row.append('</s>')
return ['<s>'] + row
news = corpus.brown.sents(categories = 'editorial')
print len(news),'\n'
x = len(news)
for row in news[:x]:
print(alter_list(row))
I tested this in a virtualenv and it works:
Is that the only error you’re getting?
By the way, as for your second question:
For words:
This split by words obviously is very superficial but depending on your application it may suffice. Obviously you could use nltk’s tokenize which is far more sophisticated.
In order to accomplish your final goal, you can do something like that:
I trimmed the output because it was unnecessarily large, but you get the idea.