Using Python and the NLTK I have written a regex to find words with start with a capital letter in a body of text but aren’t at the beginning of a sentence.
Initially I was using it as follow:
[w for w in text if re.findall(r'(?<!\.\s)\b[A-Z][a-z]\b',w)]
the variable text is created using the treebank corpus as follows:
>>> def concat(lists):
biglist = [ ]
while len(lists)>0:
biglist = biglist+lists[0]
lists=lists[1:]
return biglist
>>> tbsents = concat(treebank.sents()[200:250])
>>> text = nltk.Text(tbsents)
However this doesn’t seem to work, it still returns words that are at the beginning of sentences.
So I thought I would try using the text.findall() function instead.
I ran the following and it returned all the words with capital letters as required.
>>> text.findall("<[A-Z][a-z]{3,}>")
The problem I have is I don’t how to get the first bit of the regex in to the <..> format required for the second function, and if I do will it even work or am I taking completely the wrong approach?
Thanks.
I’m not sure what you’re doing with the first list comprehension: you’re using findall on each individual word, not on the text itself.
The simplest way to do what you want with the treebank corpus, since you already have them divided by sentence, is:
Perhaps this is what you wanted to do with the “concat” function, but that just got a list of all words- it didn’t remove the first of each sentence. If you do want to concatenate a list of lists, a much better way is the list(itertools.chain(*lists)) thing I did above.
ETA: Given that you have to work with a list of tokens, the best solution is then not to use regexes but rather: