I am trying to plot Heaps law for a given text (it shows the

Question

0

Asked: June 1, 20262026-06-01T00:31:11+00:00 2026-06-01T00:31:11+00:00

I am trying to plot Heaps law for a given text (it shows the

0

I am trying to plot Heaps law for a given text (it shows the growth of vocabulary size in function of the length of the text). That is, for each token i need the length of the text and the vocabulary size up to the given token

I have already tokenised my text, but I am stuck because I don’t know how to iterate over all words in the text.

tokens=nltk.wordpunct_tokenize(text)
it=len(tokens)
i=1
for word in tokens:
    print len(tokens), len(set(tokens))
    i=i+1
    if i>it:
        break

I basically need at each iteration for the text to grow by 1 token.
Thanks for your help!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T00:31:13+00:00

tokens is an array that is populated once by NLTK. It doesn’t grow as you iterate over it, so len(tokens) will be the same on every iteration. Since you are already accumulate the count in i. Use that instead of len(tokens).

For the unique count, you have the same problem. set(tokens) is always the full set, not the ones you’ve traversed so far. You need to accumulate the set of known words as you go:

i = 0
words = set()
for word in tokens:
    words.add(word)
    i += 1
    print i, len(words)

Edit: Silly I forgot about enumerate. See Dvir Volk’s answer for how to avoid counting i explicitly.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to plot Heaps law for a given text (it shows the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply