I’m thinking about a special regexp problem in PHP, but I can’t find an solution.
I try to split some text into terms to get simple words, numbers and web addresses.
So i decided to split on every non alphanumeric character ( \w ).
To work with different languages, I use \w with additional letters, like Ää éèÈ and so on.
Example:
20,000 15.20 This is at Text. Right?!
www.google.com Jean Béraud
Until now, i use the following regexp to split the text:
[^\w(äÄüÜöÖßèé)]
Which works well in 80% of cases, but splits 20,000 into 20 and 000 also http://www.google.com into www google com
So i tried to hold the numbers together, but still split on points, like Text. to get Text
To match 15.20, the following works: (\d+\.\d+), but how do I combine the negation with the other regexp string? The following will not work: (\d+\.\d+)|[^\w(äÄüÜöÖßèé)]?
And: how do I handle the web address?
Something like this?
Demo, Result:
Q: Why does
\wmatcheséin my example?A: That’s based on the local of the system the PCRE library is used on, from the PHP Manual:
Alternatively it might be helpful to specify the regex as working with UTF-8:
Ensure
$stringis UTF-8 encoded. As UTF-8 is international, specific locale settings might not needed to be taken into account. Give it a try.