I’m using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?
PHP does not have “native” support for multibyte characters; you need to use the “mbstring” extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a “multibyte-character string”, as such — rather, one chooses to treat a native string as multibyte-character string by using special “mbstring” functions. In other words, a PHP string does not know its own character encoding — you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to “high” bytes (for instance,
ßis encoded as0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into “special” PHP bytes, such as0x22, the “double-quote” symbol.The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are
mb_eregDocs,mb_eregiDocs,mb_ereg_replaceDocs andmb_eregi_replaceDocs.PCRE based regular expression functions like
preg_matchDocs support UTF-8 by using theu-modifier (PCRE8)Docs.But of course, as described above PHP strings don’t know their own encoding, so you first need to instruct the “mbstring” library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you’re matching, not the string containing the regular expression itself.