I am trying to plot Heaps law for a given text (it shows the growth of vocabulary size in function of the length of the text). That is, for each token i need the length of the text and the vocabulary size up to the given token
I have already tokenised my text, but I am stuck because I don’t know how to iterate over all words in the text.
tokens=nltk.wordpunct_tokenize(text)
it=len(tokens)
i=1
for word in tokens:
print len(tokens), len(set(tokens))
i=i+1
if i>it:
break
I basically need at each iteration for the text to grow by 1 token.
Thanks for your help!
tokensis an array that is populated once by NLTK. It doesn’t grow as you iterate over it, solen(tokens)will be the same on every iteration. Since you are already accumulate the count ini. Use that instead oflen(tokens).For the unique count, you have the same problem.
set(tokens)is always the full set, not the ones you’ve traversed so far. You need to accumulate the set of known words as you go:Edit: Silly I forgot about enumerate. See Dvir Volk’s answer for how to avoid counting
iexplicitly.