I would like to encode strings of varying length (typically 1-100 characters) to integers in a way that strings which are lexicographically similar (they would be close together in a dictionary) result in integers that are close together, while further ensuring that these integers are reasonably evenly distributed across the range of possible integer values.
I recognize that ensuring an even distribution may require some kind of survey of the possible strings prior to encoding them.
Does anyone have any ideas on how to do this?
Compressed keys might be useful here. The idea is to compare a set of strings and remove all bits, that are similar. Which produces a set of almost unique keys, small enough to fit in an integer. See chapter 6 of “FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs”.
Described algorithm does not always preserve lexicographical order, but may be augmented to do it.
Edit:
More general approach is to split string characters to independent parts (if possible), then determine these parts’ probabilities, and apply arithmetic coding.
Edit2:
To fit more of the string in the compressed key it may be preferrable to use some sort of entropy encoding, where encoding of a character involves values of several, but no more than 1 .. 2 previous characters (improving compressibility too much will degrade performance). Or, if integer key should be short enough (like 16 bits), it is better to use entropy methods to precalculate all keys and put them to the collection ordered by strings; in this case encoding prefix may be much longer.