I’m looking for a way to shorten an already short string as much as possible.
The string is a hostname:port combo and could look like “my-domain.se:2121” or “123.211.80.4:2122“.
I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.
Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:
- Encode the string to a byte array using some kind of custom encoding
- Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).
And then reverse the process to get the original string.
So to my questions:
- Could this work?
- Is there a better way?
- How?
You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)
If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)
prints
I leave decoding as an exercise. 😉