I’m tring to learn SCSU
http://unicode.org/reports/tr6
but when I try Java sample code, the output always larger than input.
This is what I got:
I tried this example:
Öl fließt
they say the input:
Unicode code points (9 code points):
00D6 006C 0020 0066 006C 0069 0065 00DF 0074
and output is :
Compressed (9 bytes):
D6 6C 20 66 6C 69 65 DF 74
But what I got is:
Input:
famihug@hvn:/home/famihug/TestRoom/SCSU%xxd german.txt [0]
0000000: c396 6c20 666c 6965 c39f 7420 0a ..l flie..t .
Output:
famihug@hvn:/home/famihug/TestRoom/SCSU%java CompressMain /compress german.txt
Compressed german.txt: 6 chars to german.csu 13 bytes. Ratio: 108%.
famihug@hvn:/home/famihug/TestRoom/SCSU%ls -lt german.* [0]
-rw-r--r-- 1 famihug famihug 13 2012-06-09 10:24 german.csu
-rw-r--r-- 1 famihug famihug 13 2012-06-08 01:04 german.txt
famihug@hvn:/home/famihug/TestRoom/SCSU%xxd german.csu [0]
0000000: 0fc3 966c 2066 6c69 65c3 9f74 20
~~~~~~~~~~~~~
And this is when I tried Japanese sample:
famihug@hvn:/home/famihug/TestRoom/SCSU%wc -m jav.txt [0]
117 jav.txt
famihug@hvn:/home/famihug/TestRoom/SCSU%ls -lt jav.* [0]
-rw-r--r-- 1 famihug famihug 349 2012-06-08 01:13 jav.txt
-rw-r--r-- 1 famihug famihug 405 2012-06-08 01:01 jav.csu
they said output is Compressed (178 bytes)
I use gedit/Vim to paste the sample plaintext to file. What did I doing wrong here?
It looks like the sample encoder is expecting UTF-16 input, and you’re giving it UTF-8.
This input:
c396 6c20 666c 6965 c39f 7420 0aisÖl fließtin UTF-8, with a trailing space and newline.What you’re getting back is
0fc3 966c 2066 6c69 65c3 9f74 20. The first0fis theSCUtag, which indicates that the rest of the bytes are big-endian UTF-16. The thing is, instead of the UTF-16 equivalents of your input string, the rest of the bytes are just the exact same bytes from the input (minus the newline), and those same bytes represent totally different characters between UTF-8 and UTF-16.The output you’re getting back seems to represent
쎖氠晬楥쎟琠. Note that this is a 6 character long string, asCompressMainreported. You could run your compressed output back through/expandof the same class to confirm.If you encode your input file in UTF-16, not UTF-8 you should get the output you’re expecting.