I’m tring to learn SCSU http://unicode.org/reports/tr6 but when I try Java sample code ,

Question

0

Asked: June 5, 20262026-06-05T10:42:40+00:00 2026-06-05T10:42:40+00:00

I’m tring to learn SCSU http://unicode.org/reports/tr6 but when I try Java sample code ,

0

I’m tring to learn SCSU
http://unicode.org/reports/tr6
but when I try Java sample code, the output always larger than input.
This is what I got:
I tried this example:

Öl fließt

they say the input:

Unicode code points (9 code points):
00D6 006C 0020 0066 006C 0069 0065 00DF 0074

and output is :

Compressed (9 bytes):
D6 6C 20 66 6C 69 65 DF 74

But what I got is:
Input:

famihug@hvn:/home/famihug/TestRoom/SCSU%xxd german.txt                      [0]
0000000: c396 6c20 666c 6965 c39f 7420 0a         ..l flie..t .

Output:

famihug@hvn:/home/famihug/TestRoom/SCSU%java CompressMain /compress german.txt
Compressed german.txt: 6 chars to german.csu 13 bytes. Ratio: 108%.

famihug@hvn:/home/famihug/TestRoom/SCSU%ls -lt german.*                     [0]
-rw-r--r-- 1 famihug famihug 13 2012-06-09 10:24 german.csu
-rw-r--r-- 1 famihug famihug 13 2012-06-08 01:04 german.txt

famihug@hvn:/home/famihug/TestRoom/SCSU%xxd german.csu                      [0]
0000000: 0fc3 966c 2066 6c69 65c3 9f74 20

~~~~~~~~~~~~~
And this is when I tried Japanese sample:

famihug@hvn:/home/famihug/TestRoom/SCSU%wc -m jav.txt                       [0]
117 jav.txt
famihug@hvn:/home/famihug/TestRoom/SCSU%ls -lt jav.*                        [0]
-rw-r--r-- 1 famihug famihug 349 2012-06-08 01:13 jav.txt
-rw-r--r-- 1 famihug famihug 405 2012-06-08 01:01 jav.csu

they said output is Compressed (178 bytes)

I use gedit/Vim to paste the sample plaintext to file. What did I doing wrong here?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T10:42:41+00:00

It looks like the sample encoder is expecting UTF-16 input, and you’re giving it UTF-8.

This input: c396 6c20 666c 6965 c39f 7420 0a is Öl fließt in UTF-8, with a trailing space and newline.

What you’re getting back is 0fc3 966c 2066 6c69 65c3 9f74 20. The first 0f is the SCU tag, which indicates that the rest of the bytes are big-endian UTF-16. The thing is, instead of the UTF-16 equivalents of your input string, the rest of the bytes are just the exact same bytes from the input (minus the newline), and those same bytes represent totally different characters between UTF-8 and UTF-16.

The output you’re getting back seems to represent 쎖氠晬楥쎟琠. Note that this is a 6 character long string, as CompressMain reported. You could run your compressed output back through /expand of the same class to confirm.

If you encode your input file in UTF-16, not UTF-8 you should get the output you’re expecting.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m tring to learn SCSU http://unicode.org/reports/tr6 but when I try Java sample code ,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply