I’m trying to split a string on its punctuation, but the string may contain URLs (which conveniently has all the typical punctuation marks).
I have a basic working knowledge of RegEx, but not enough to help me out here. This is what I was using when I discovered the problem:
$text[$i] = preg_split('/[\.\?!\-]+/', $post->text);
(this also accounts for multiple consecutive punctuation characters – ellipses, !!!!, ????, ?!?, etc)
How would I split a string on the punctuation while maintaining the integrity of URLs? Thanks!
Edit:
My apologies…an example would be something along the lines of a tweet:
"Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value ."
The results should look something like this:
[0] => "Blah blah blah?"
[1] => "A sentence."
[2] => "Here's a link: http://somelink.com?key=value ."
What you’re doing here isn’t quite splitting on punctuation, because you’re trying to keep the punctuation in one of the split items. You’re also attempting to discard the whitespace afterwards, but don’t seem to have covered that in your question.
I would tackle this in the following way: split your input string with a regular expression which matches punctuation or a URL, and keep the pieces, including the separators. Then iterate over the items, and for each separator decide whether it was punctuation, in which case you can strip trailing whitespace and move it to the end of the previous item, or a URL, in which case you just join it with the preceding and following items.
In PHP, you can keep the delimiters using something like this:
where the
PREG_SPLIT_DELIM_CAPTUREflag is explained in the documentation as: