I’d like to store an additional column in a table as a ‘sort value’, which is a numeric representation of the title column, such that the order of such values represents the string’s natural alphabetical sort order. Ie, so that I can retrieve rows ordered by the sort value, and they’ll be in natural sort order – and when I insert a new row, I can generate the numeric value and know that value relative to others will represent the string’s position in an alphabetic search, accurate to the first X letters or so.
A couple of reasons for this: firstly, I would like a more natural ordering than a plain ordering offered by a DB server, where things like ‘The’ and ‘A’ and punctuation are ignored at the start, and numbers are treated ‘naturally’.
Secondly, this is for an index with a lot of permutations – it will save space, and perhaps time when traversing an index with many rows.
What I am after for is the algorithm to translate the string to that numeric value, or just, I suppose, a normalised string value.
I am using PHP and MySQL.
I’m afraid that ‘pull everything from the DB and sort in PHP using natcasesort()’ is not a solution for this particular situation, as I’d like to retrieve rows (using order by and group by) in sorted order before they get to a join or limit clause. Thanks.
Edit:
Thanks for answers so far. It’s just occurred to me that the fact my application uses UTF-8 is quite relevant. With that said, I think the practicality of representing the initial part of a string in a packed/numeric form is a stretch, maybe just some sort of normalised form (everything case-folded, numbers zero-padded, and as many characters as possible normalised to their root ie ã to a) would be appropriate.
Thanks for the answers so far. I just wanted to update people with the solution I’m going with. I’ve taken an approach that is different from that which I envisaged in my question.
To recap, I wanted to store a representations of strings such that when retrieved in binary order, whatever I stored for ‘8 Mile’ would be sorted before whatever I stored for ‘101 Dalmations’.
For each number in the string, which is essentially a sequence of digits, I insert a digit before them that describes how many digits the number is.
So, ‘8’ becomes ’18’, and ‘101’ becomes ‘3101’. It adds some redundancy to the number, in that you are using more digits than you need and some values won’t exist, but they now have the property that a binary sort will sort the numbers into numerical order. ‘101’ would have sorted before ‘8’ beforehand, which was undesired. After adding that extra digit, ’18’ sorts before ‘3101’.
Note: if the number is 9 or more digits long, I add two digits to the start: the number of digits in the number minus 9, then a 9, then the number. This allows for numbers up to 18 digits: good enough for me.
I’m also normalising the string in other ways too – everything to lower case, Unicode characters will be translated into the closest ascii equivalent, and ‘a’, ‘an’, and ‘the’ will be stripped if they are the first word.
I gave up on making the string into one big numeric value; it is still a string, it’s just that it’s not designed for humans to read.