I’m trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.
For instance:
>>> c = 'help, me' >>> print c.split() ['help,', 'me']
What I really want the list to look like is:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.
I’ve tried to parse the string first and then run the split:
>>> for character in c: ... if character in '.,;!?': ... outputCharacter = ' %s' % character ... else: ... outputCharacter = character ... separatedPunctuation += outputCharacter >>> print separatedPunctuation help , me >>> print separatedPunctuation.split() ['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.
Is there a way to do this more efficiently?
This is more or less the way to do it:
The trick is, not to think about where to split the string, but what to include in the tokens.
Caveats: