My input consists of user-posted strings.
What I want to do is create a dictionary with words, and how often they’ve been used.
This means I want to parse a string, remove all garbage, and get a list of words as output.
For example, say the input is
"#@!@LOLOLOL YOU'VE BEEN \***PWN3D*** ! :') !!!1einszwei drei !"
The output I need is the list:
"LOLOLOL""YOU'VE""BEEN""PWN3D""einszwei""drei"
I’m no hero at regular expressions and have been Googling, but my Google-kungfu seams to be weak …
How would I go from input to the wanted output?
Simple Regex:
\w+This matches a string of “word” characters. That is almost what you want.
This is slightly more accurate:
\w(?<!\d)[\w'-]*It matches any number of word characters, ensuring that the first character was not a digit.
Here are my matches:
Now, that’s more like it.
EDIT:
The reason for the negative look-behind, is that some regex flavors support Unicode characters. Using [a-zA-Z] would miss quite a few “word” characters that are desirable. Allowing
\wand disallowing\dincludes all Unicode characters that would conceivably start a word in any block of text.EDIT 2:
I have found a more concise way to get the effect of the negative lookbehind: Double negative character class with a single negative exclusion.
[^\W\d][\w'-]*(?<=\w)This is the same as the above with the exception that it also ensures that the word ends with a word character. And, finally, there is:
[^\W\d](\w|[-']{1,2}(?=\w))*Ensuring that there are no more than two non-word-characters in a row. Aka, It matches “word-up” but not “word–up”, which makes sense. If you want it to match “word–up”, but not “word—up”, you can change the
2to a3.