I need some Utf32 test strings to exercise some cross platform string manipulation code. I’d like a suite of test strings that exercise the utf32 <-> utf16 <-> utf8 encodings to validate that characters outside the BMP can be transformed from utf32, through utf16 surrogates, through utf8, and back. properly.
And I always find it a bit more elegant if the strings in question aren’t just composed of random bytes, but are actually meaningful in the (various) languages they encode.
Although this isn’t quite what you asked for, I’ve always found this test document useful.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The same site offers this
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
… which are equivalents of English’s “Quick brown fox” text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of “pangrams” which used to be on Wikipedia, but was apparently deleted there. It is still available here:
http://clagnut.com/blog/2380/