Here’s the problem:
split=re.compile('\\W*')
This regular expression works fine when dealing with regular words, but there are occasions where I need the expression to include words like käyttäj&aml;auml;.
What should I add to the regex to include the & and ; characters?
You probably want to take the problem reverse, i.e. finding all the character without the spaces:
Or you want to add the extra characters:
In case you want to match HTML entities, you should try something like: