Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7025371
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T23:57:59+00:00 2026-05-27T23:57:59+00:00

Using Python and the NLTK I have written a regex to find words with

  • 0

Using Python and the NLTK I have written a regex to find words with start with a capital letter in a body of text but aren’t at the beginning of a sentence.

Initially I was using it as follow:

[w for w in text if re.findall(r'(?<!\.\s)\b[A-Z][a-z]\b',w)]

the variable text is created using the treebank corpus as follows:

 >>> def concat(lists):
    biglist = [ ]
    while len(lists)>0:
        biglist = biglist+lists[0]
        lists=lists[1:]
    return biglist
>>> tbsents = concat(treebank.sents()[200:250])
>>> text = nltk.Text(tbsents)

However this doesn’t seem to work, it still returns words that are at the beginning of sentences.
So I thought I would try using the text.findall() function instead.
I ran the following and it returned all the words with capital letters as required.

>>> text.findall("<[A-Z][a-z]{3,}>")

The problem I have is I don’t how to get the first bit of the regex in to the <..> format required for the second function, and if I do will it even work or am I taking completely the wrong approach?

Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T23:57:59+00:00Added an answer on May 27, 2026 at 11:57 pm

    I’m not sure what you’re doing with the first list comprehension: you’re using findall on each individual word, not on the text itself.

    The simplest way to do what you want with the treebank corpus, since you already have them divided by sentence, is:

    import itertools
    non_starting_words = list(itertools.chain(*[s[1:] for s in treebank.sents()]))
    uppercase_words = [w for w in non_starting_words if w[0].isupper()]
    

    Perhaps this is what you wanted to do with the “concat” function, but that just got a list of all words- it didn’t remove the first of each sentence. If you do want to concatenate a list of lists, a much better way is the list(itertools.chain(*lists)) thing I did above.

    ETA: Given that you have to work with a list of tokens, the best solution is then not to use regexes but rather:

    punctuation_marks = ".!?"
    first_word = True
    uppercase_words = []
    
    for w in text:
        if not first_word and re.match("[A-Z][a-z]*$", w):
            uppercase_words.append(w)
        first_word = w in punctuation_marks
    
    print uppercase_words
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Using Python and regex I am trying to find words in a piece of
Using Python I want to replace all URLs in a body of text with
I would like to find the relatedness (not similarity) between two words using Python.
Using Python I want to be able to draw text at different angles using
Using Python I would like to find the date object for last Wednesday. I
I am new to python and am using it to use nltk in my
(Using Python 3.2, though I doubt it matters.) I have class Data , class
I am using NLTK to extract nouns from a text-string starting with the following
I am using nltk with Python and I would like to plot the ROC
I am using Python and NLTK to build a language model as follows: from

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.