Is it possible use regex to remove small words in a text? For example, I have the following string (text):
anytext = " in the echo chamber from Ontario duo "
I would like remove all words that is 3 characters or less. The Result should be:
"echo chamber from Ontario"
Is it possible do that using regular expression or any other python function?
Thanks.
Certainly, it’s not that hard either:
The above expression selects any word that is preceded by some non-word characters (essentially whitespace or the start), is between 1 and 3 characters short, and ends on a word boundary.
The
\bboundary matches are important here, they ensure that you don’t match just the first or last 3 characters of a word.The
\W*at the start lets you remove both the word and the preceding non-word characters so that the rest of the sentence still matches up. Note that punctuation is included in\W, use\sif you only want to remove preceding whitespace.For what it’s worth, this regular expression solution preserves extra whitespace between the rest of the words, while mgilson’s version collapses multiple whitespace characters into one space. Not sure if that matters to you.
His list comprehension solution is the faster of the two: