I have a string exactly 53 characters long that contains a limited set of possible characters.
[A-Za-z0-9\.\-~_+]{53}
I need to reduce this to length 50 without loss of information and using the same set of characters.
I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?
Thanks for reading.
If, as you stated, your output strings have to use the same set of characters as the input string, and if you don’t know anything special about the requirements of the input string, then no, it’s not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.
To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input — perhaps the first character can only be a ‘3’ or a ‘5’, the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.
If you can’t do either of those things, you’ll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).