Creating a basic ngram implementation in Python as a personal challenge. Started with unigrams and worked up to trigrams:
def unigrams(text):
uni = []
for token in text:
uni.append([token])
return uni
def bigrams(text):
bi = []
token_address = 0
for token in text[:len(text) - 1]:
bi.append([token, text[token_address + 1]])
token_address += 1
return bi
def trigrams(text):
tri = []
token_address = 0
for token in text[:len(text) - 2]:
tri.append([token, text[token_address + 1], text[token_address + 2]])
token_address += 1
return tri
Now the fun part, generalize to n-grams. The main problem with generalizing the approach I have here is creating the list of length n that goes into the append method. I thought initially that lambdas might be a way to do it, but I can’t figure out how.
Also, other implementations I’m looking at are taking an entirely different tack (no surprise), e.g. here and here, so I’m starting to wonder if I’m at a dead end.
Before I give up on this approach, I’m curious: 1) is there a one line or pythonic method of creating an arbitrary list size in this manner? 2) what are the downsides of approaching the problem this way?
The following function should work for a general n-gram model.