This is going to be a nice little brainbender I think. It is a real life problem, and I am stuck trying to figure out how to implement it. I don’t expect it to be a problem for years, and at that point it will be one of those “nice problems to have”.
So, I have documents in my search engine index. The documents can have a number of fields, however, each field size must be limited to only 100kb.
I would like to store the IDs, of particular sites which have access to this document. The site id count is low, so it is never going to get up into the extremely high numbers.
So example, this document here can be accessed by sites which have an ID of 7 and 10.
Document: {
docId: "1239"
text: "Some Cool Document",
access: "7 10"
}
Now, because the “access” field is limited to 100kb, that means that if you were to take consecutive IDs, only 18917 unique IDs could be stored.
Reference:
http://codepad.viper-7.com/Qn4N0K
<?php
$ids = range(1,18917);
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
// Output
99.9951171875kb
In my application, a particular site, of site ID 7, tries to search, and he will have access to that “Some Cool Document”
So now, my question would be, is there any way, that I could some how fit more IDs into that field?
I’ve thought about proper encoding, and applying something like a Huffman Tree, but seeing as each document has different IDs, it would be impossible to apply a single encoding set to every document.
Prehaps, I could use something like tokenized roman numerals?
Anyway, I’m open to ideas.
I should add, that I want to keep all IDs in the same field, for as long as possible. Searching over a second field, will have a considerable performance hit. So I will only switch to using a second access2 field, when I have milked the access field for as long as possible.
Edit:
Convert to Hex
<?php
function hexify(&$item){
$item = dechex($item);
}
$ids = range(1,21353);
array_walk( $ids, "hexify");
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
This yields a performance boost of 21353 consecutive IDs.
So that is up like 12.8%
Important Caveat
I think the fact that my fields can only store UTF encoded characters makes it next to impossible to get anything more out of it.
Where did 18917 come from? 100kb is a big number.
You have 100,000 or so bytes. Each byte can be 255 long, if you store it as a number.
If you encode as hex, you’ll get 100,000 ^ 16, which is a very large number, and that just hex encoding.
What about base64? You stuff 3 bytes into a 4 byte space (a little loss), but you get 64 characters per character. So 100,000 ^ 64. That’s a big number.
You won’t have any problems. Just do a simple hex encoding.
EDIT:
TL;DR
Let’s say you use base64. You could fit 6.4 times more data in the same spot. No compression needed.