In the good old days when I was a web developer (using PHP), I used to run all submitted form data through a regex before commencing any processing. For most cases, I would allow alphanumerics along with a small set of punctuation characters which would satisfy 99% of people 99% of the time whilst providing a defense against SQL injection and cross site scripting (yes I used PDO prepared statements as well).
More recently I’ve had to deal with input in an internationalised context, specifically, where the input can be in quite a few different western and eastern European languages as well as Arabic. In these cases, I resorted to removing potentially dangerous characters and letting everything else in. The application had a very small number of users (less than 10) and was only deployed on their internal network so I wasn’t overly concerned about the security of the system but I wouldn’t be comfortable taking this approach on a publicly accessible website.
In summary, I would like the input to be filtered so that what is left, is “plain text” but I’m not sure how to define the concept of plain text in an internationalised context. Are there any PHP libraries that address this?
Everything is “plain text”. Even “‘ DROP TABLE users –” is plain text. Even “<script>” is just plain text.
What you’re worried about are “special characters”, i.e. plain text which has special meanings in certain contexts. For that, you need to escape theses special characters to “defuse” them in the given context. For HTML, escape them to HTML entities. For SQL, SQL-escape the string (or use prepared statements to avoid this problem in general). For CSV, CSV-escape the values… You get the idea. There are always functions or libraries available which will do this for you, don’t try to reinvent the wheel here.
If you want to sanitize, i.e. remove content, you need to define better what you want to remove. Removing content also always runs the risk of removing legitimate content your users may want to use. So it’s usually the annoying option.
For more on this topic, see The Great Escapism (Or: What You Need To Know To Work With Text Within Text).