My input consists of user-posted strings. What I want to do is create a

Question

0

Asked: May 13, 20262026-05-13T15:35:47+00:00 2026-05-13T15:35:47+00:00

My input consists of user-posted strings. What I want to do is create a

0

My input consists of user-posted strings.

What I want to do is create a dictionary with words, and how often they’ve been used.
This means I want to parse a string, remove all garbage, and get a list of words as output.

For example, say the input is
"#@!@LOLOLOL YOU'VE BEEN \***PWN3D*** ! :') !!!1einszwei drei !"

The output I need is the list:

"LOLOLOL"
"YOU'VE"
"BEEN"
"PWN3D"
"einszwei"
"drei"

I’m no hero at regular expressions and have been Googling, but my Google-kungfu seams to be weak …

How would I go from input to the wanted output?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T15:35:47+00:00

Simple Regex:

\w+

This matches a string of “word” characters. That is almost what you want.

This is slightly more accurate:

\w(?<!\d)[\w'-]*

It matches any number of word characters, ensuring that the first character was not a digit.

Here are my matches:

1 LOLOLOL
2 YOU’VE
3 BEEN
4 PWN3D
5 einszwei
6 drei

Now, that’s more like it.

EDIT:
The reason for the negative look-behind, is that some regex flavors support Unicode characters. Using [a-zA-Z] would miss quite a few “word” characters that are desirable. Allowing \w and disallowing \d includes all Unicode characters that would conceivably start a word in any block of text.

EDIT 2:
I have found a more concise way to get the effect of the negative lookbehind: Double negative character class with a single negative exclusion.

[^\W\d][\w'-]*(?<=\w)

This is the same as the above with the exception that it also ensures that the word ends with a word character. And, finally, there is:

[^\W\d](\w|[-']{1,2}(?=\w))*

Ensuring that there are no more than two non-word-characters in a row. Aka, It matches “word-up” but not “word–up”, which makes sense. If you want it to match “word–up”, but not “word—up”, you can change the 2 to a 3.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My input consists of user-posted strings. What I want to do is create a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply