I was designing a regex to split all the actual words from a given text:
Input Example:
"John's mom went there, but he wasn't there. So she said: 'Where are you'"
Expected Output:
["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]
I thought of a regex like that:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
After splitting in Python, the result contains None items and empty spaces.
How to get rid of the None items? And why didn’t the spaces match?
Edit:
Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]
Instead of regex, you can use string-functions:
BUT, in your example you do not want to remove apostrophe in
John'sbut you wish to remove it inyou!!'. So string operations fails in that point and you need a finely adjusted regex.EDIT: probably a simple regex can solve your porblem:
It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.
This second regex is for a very specific situation…. First regex can capture words like
you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostropheMoss' momwith the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.Example:
UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like
A'. Fixed brand new regex is here: