In an application that accepts, stores, processes, and displays Unicode text (for the purpose

Question

0

Asked: May 15, 20262026-05-15T20:49:33+00:00 2026-05-15T20:49:33+00:00

In an application that accepts, stores, processes, and displays Unicode text (for the purpose

0

In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let’s say that it’s a web application), which characters should always be removed from incoming text?

I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:

The range 0x00–0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
The range 0x7F–0x9F (more control characters)

Ranges of characters that can safely be accepted would be even better to know.

There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I’m mainly interested in the basics.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T20:49:33+00:00

See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I’d definitely filter out for most web sites. It notably includes such characters as:

U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;
U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;
language override control codes that could also have scope outside of an element;
BOM.

Additionally, you’d want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.

The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.

The range 0x7F-0x9F (more control characters)

Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you’d not want them.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In an application that accepts, stores, processes, and displays Unicode text (for the purpose

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply