I was designing a regex to split all the actual words from a given

Question

0

Editorial Team

Asked: June 12, 20262026-06-12T09:22:52+00:00 2026-06-12T09:22:52+00:00

I was designing a regex to split all the actual words from a given

0

I was designing a regex to split all the actual words from a given text:

Input Example:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"

Expected Output:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]

I thought of a regex like that:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

After splitting in Python, the result contains None items and empty spaces.

How to get rid of the None items? And why didn’t the spaces match?

Edit:

Splitting on spaces, will give items like: ["there."]

And splitting on non-letters, will give items like: ["John","s"]

And splitting on non-letters except ', will give items like: ["'Where","you'"]

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T09:22:53+00:00

Instead of regex, you can use string-functions:

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.

EDIT: probably a simple regex can solve your porblem:

(\w[\w']*)

It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.

(\w[\w']*\w)

This second regex is for a very specific situation…. First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.

Example:

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was designing a regex to split all the actual words from a given

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply