I am trying to adapt a php application to handle non-latin scripts (specifically: Japanese, simplified Chinese and Arabic). The app’s data validation routines make frequent use of regular expressions to check input, but I am not sure how to adapt the \w character type to other languages without installing additional locales on the system (which I cannot rely on).
Previous developers to have worked on the app have simply added needed characters to the regexes as the number of languages we supported grew (you frequently see “[\wÀÁÂÃÄÅÆÇÈÉ… etc” in the code), but I can’t really do this for all the alphabets I need to support now.
Does anybody out there have some advice on how to tackle this?
See this comment on php.net: http://www.php.net/manual/en/regexp.reference.unicode.php#102756
for example:
…and so on
demonstration: http://cecb.freephptest.com/