Using Python and the NLTK I have written a regex to find words with

Question

0

Asked: May 27, 20262026-05-27T23:57:59+00:00 2026-05-27T23:57:59+00:00

Using Python and the NLTK I have written a regex to find words with

0

Using Python and the NLTK I have written a regex to find words with start with a capital letter in a body of text but aren’t at the beginning of a sentence.

Initially I was using it as follow:

[w for w in text if re.findall(r'(?<!\.\s)\b[A-Z][a-z]\b',w)]

the variable text is created using the treebank corpus as follows:

 >>> def concat(lists):
    biglist = [ ]
    while len(lists)>0:
        biglist = biglist+lists[0]
        lists=lists[1:]
    return biglist
>>> tbsents = concat(treebank.sents()[200:250])
>>> text = nltk.Text(tbsents)

However this doesn’t seem to work, it still returns words that are at the beginning of sentences.
So I thought I would try using the text.findall() function instead.
I ran the following and it returned all the words with capital letters as required.

>>> text.findall("<[A-Z][a-z]{3,}>")

The problem I have is I don’t how to get the first bit of the regex in to the <..> format required for the second function, and if I do will it even work or am I taking completely the wrong approach?

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T23:57:59+00:00

I’m not sure what you’re doing with the first list comprehension: you’re using findall on each individual word, not on the text itself.

The simplest way to do what you want with the treebank corpus, since you already have them divided by sentence, is:

import itertools
non_starting_words = list(itertools.chain(*[s[1:] for s in treebank.sents()]))
uppercase_words = [w for w in non_starting_words if w[0].isupper()]

Perhaps this is what you wanted to do with the “concat” function, but that just got a list of all words- it didn’t remove the first of each sentence. If you do want to concatenate a list of lists, a much better way is the list(itertools.chain(*lists)) thing I did above.

ETA: Given that you have to work with a list of tokens, the best solution is then not to use regexes but rather:

punctuation_marks = ".!?"
first_word = True
uppercase_words = []

for w in text:
    if not first_word and re.match("[A-Z][a-z]*$", w):
        uppercase_words.append(w)
    first_word = w in punctuation_marks

print uppercase_words

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Using Python and the NLTK I have written a regex to find words with

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply