Let’s say I have three sentences:
-
hello world -
hello python -
today is tuesday
If I generate bigrams of each string it would generate something like this:
[('hello', 'world')]
[('this', 'is'), ('is', 'python')]
[('today', 'is'), ('is', 'tuesday')]
Is there a difference between bigrams for a sentence and bigrams for two consecutive sentences? For example, hello world. hello python is two consecutive sentences. Will bigrams for these two consecutive sentences look like my output?
The code that produced it:
from itertools import tee, izip
def bigrams(iterable):
a, b = tee(iterable)
next(b, None)
return izip(a, b)
with open("hello.txt", 'r') as f:
for line in f:
words = line.strip().split()
bi = bigrams(words)
print list(bi)
It depends what you want. If you define the items of your bigrams to be a whole sentence, it would look like this:
If you want the bigrams where the type of an item is a word, for all sentences this would look like this: