Due to the fact that Java code could be run in any Java VM I’d like to know how is it possible to identify programmatically which Unicode version supported?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
The easiest way but worst way I can think of to do that would be to pick a code point that’d new to each Unicode release, and check its Character properties. Or you could check its General Category with a regex. Here are some selected code points:
Unicode 6.0.0:
Unicode 5.2:
Unicode 5.1:
Unicode 5.0:
I’ve included the general category and the script property, although you can only inspect the script in JDK7, the first Java release that supports that.
I found those code points by running commands like this from the command line:
Where that’s the unichars program. It will only find properties supported in the Unicode Character Database for whichever UCD version that the version of Perl you’re running supports.
I also like my output sorted, so I tend to run
where that’s the ucsort program, which sorts text according to the Unicode Collation Algorithm.
However, in Perl unlike in Java this is easy to find out. For example, if you
run this from the command line (yes, there’s a programmer API, too), you find:
That shows that Perl version 5.14.0 was the first one to support Unicode 6.0.0. For Java, I believe there is no API that gives you this information directly, so you’ll have to hardcode a table mapping Java versions and Unicode versions, or else use the empirical method of testing code points for properties. By empirically, I mean the equivalent of this sort of thing:
To find out the age of a particular code point, run uniprops -a on it like this:
All my Unicode tools are available in the Unicode::Tussle bundle, including unichars, uninames, uniquote, ucsort, and many more.
Java 1.7 Improvements
JDK7 goes a long way to making a few Unicode things easier. I talk about that a bit at the end of my OSCON Unicode Support Shootout talk. I had thought of putting together a table of which languages supports which versions of Unicode in which versions of those languages, but ended up scrapping that to tell people to just get the latest version of each language. For example, I know that Unicode 6.0.0 is supported by Java 1.7, Perl 5.14, and Python 2.7 or 3.2.
JDK7 contains updates for classes
Character,String, andPatternin support of Unicode 6.0.0. This includes support for Unicode script properties, and several enhancements toPatternto allow it to meet Level 1 support requirements for Unicode UTS#18 Regular Expressions. These includeThe
isupperandislowermethods now correctly correspond to the Unicode uppercase and lowercase properties; previously they misapplied only to letters, which isn’t right, because it missesOther_UppercaseandOther_Lowercasecode points, respectively. For example, these are some lowercase codepoints which are notGC=Ll(lowercase letters), selected samples only:The alphabetic tests are now correct in that they use
Other_Alphabetic. They did this wrong prior to 1.7, which is a problem.The
\x{HHHHH}pattern escape so you can meet RL1.1; this lets you rewrite[-](which fails due to The UTF‐16 Curse) as[\x{1D49C}-\x{1D4B5}]. JDK7 is the first Java release that fully/correctly supports non-BMP characters in this regard. Amazing but true.More properties for RL1.2, of which the script property is by far the most important. This lets you write
\p{script=Greek}for example, abbreviated as\p{Greek}.The new
UNICODE_CHARACTER_CLASSESpattern compilation flag and corresponding pattern‐embeddable flag"(?U)"to meet RL1.2a on compatibility properties.I can certainly see why you want to make sure you’re running a Java with Unicode 6.0.0 support, since that comes with all those other benefits, too.