I have a regular expression that Im using in php:
$word_array = preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path), NULL, PREG_SPLIT_NO_EMPTY
);
It works great. It takes a chunk of url paramaters like:
/2009/06/pagerank-update.html
and returns an array like:
array(4) {
[0]=>
string(4) "2009"
[1]=>
string(2) "06"
[2]=>
string(8) "pagerank"
[3]=>
string(6) "update"
}
The only thing I need is for it to also not return strings that are less than 3 characters. So the "06" string is garbage and I’m currently using an if statement to weed them out.
The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let’s check your split pattern:
I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:
That for some sorting upfront. Let’s call this pattern the split pattern,
sin short and define it.You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.
I could achieve this with the following pattern, including support of the correct split sequences and unicode support.
Or in smaller:
Result:
The same principle can be used with
preg_splitas well. It’s a little bit different:Usage:
Result:
These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.
Related questions:
Because you are using a split routine, it will split – regardless of the length.
So what you can do is to filter the result. You can do that again with a regular expression (
preg_filter), for example one that is dropping everything smaller three characters:Result: