I’m trying to do a search for just letters and spaces (simple words) in other languages, and if I find numbers or punctuation, throw a detection exception. When testing the regex i’ve written with UTF-8 numeric characters I found on wikipedia, my results always come back a match, and I’m baffled as to why unless it thinks all numbers are considered letters.
Here’s the characters I’ve tried:
5 or 伍
http://en.wikipedia.org/wiki/Chinese_numerals
5 or Є
http://en.wikipedia.org/wiki/Cyrillic_script
Here’s the code:
$were_bad_characters_found = preg_match('/[^\p{L}\p{Zs}]+/us', $data);
The answer to the question it asks is always, no, there were no bad characters found.
It seemed, based on the docs, that this would work, and it in fact does work when I try to just run simple english numbers through it, but as soon as multilingual characters hit, it just rolls over on me. I have a number of variations on this for detecting different common scenarios, and all the utf8 regex code only seems to work well for english characters. Thoughts?
The characters you showed are letters.
U+4F0D 伍, Is not a digit and has non-numeric interpretations.
U+0404 Є Not a digit, but also not even close to having any kind numeric interpretation.
The properties of english digits in unicode make it a Digit and not a letter. In PHP you can use
\p{Nd}, to match digits. But your regex is working fine.