I have to generate random sentences using nltk. However, it seems like text.generate() only gives us sentences with trigrams. Is there any way I can expand this to include unigrams as well as bigrams?
My current code is:
exclude = set(string.punctuation)
ln = ''.join(ch for ch in ln if ch not in exclude)
words = nltk.word_tokenize(ln)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)
tText = Text(words)
tText1 = Text(my_bigrams)
tText2 = Text(my_trigrams)
print tText.generate()
print tText1.generate()
print tText2.generate()
Change in generate() function:
def generate(self, length=100, c=3):
"""
Print random text, generated using a trigram language model.
:param length: The length of text to generate (default=100)
:type length: int
:seealso: NgramModel
"""
if '_trigram_model' not in self.__dict__:
print "Building ngram index..."
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
self._trigram_model = NgramModel(c, self, estimator=estimator)
text = self._trigram_model.generate(length)
print tokenwrap(text)
After reading your code, here’s some considerations:
nltk.Text takes as an argument words only, not tuples (bigrams, trigrams).
nltk.Text.generate() generates text using trigrams exclusively, as the documentation will tell you
You will need to write the generation function yourself if you want to use unigrams and bigrams. It should be straightforward, taking as a starting point generate()’s source.
You could even attach it dynamically to the nltk.Text class so it becomes available to the objects:
nltk.Text.generate_with_ngrams= my_generation_function(don’t forget to include “self” as first argument)