While searching for a proper way to trim non-breaking space from parsed HTML, I’ve first stumbled on Java’s spartan definition of String.trim() which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on the Character class would do the job for me.
That’s when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:
It is a Unicode space character (
SPACE_SEPARATOR,LINE_SEPARATOR, orPARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0','\u2007','\u202F').
Why is that?
The implementation of corresponding .NET equivalent is less discriminating.
Character.isWhitespace(char)is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.
Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can’t, because there almost certainly exists software that relies on the current implementation working exactly the way it does.