I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then – after applying the regex with u-modifier (unicode) – if you need the original encoding, decode again (according to local). So you would start with:
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
You want all word-characters:
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
Replace max 2 “word-characters” followed by a boundary or the end:
So your regex could look like this:
And decode to your local, if you like:
Good luck! Robert