My data structure is a linked list of blocks. A block contains 31 elements of 4 byte and one 4 byte pointer to the next block or NULL(in summary 128 bytes per block). I add elements from time to time. If the last block is full, I add another block via pointer.
One objective is to use as less memory (= blocks) as possible and having no free space between two elements in a block.
This setting is fix. All code runs on a 32-bit ARM Cortex-A8 CPU with NEON pipeline.
Question:
How to find a specific element in that data structure as quickly as possible?
Approach (right now):
I use sorted blocks and binary search to check for an element (9 bit of the 4 byte are the search criteria). If the desired element is not in the current block I jump to the next block. If the element is not in the last block and the last block is not yet full, I use the result of the binary search to insert the new element (if necessary I make space using memmove within this block). Thus all blocks are always sorted.
Do you have an idea to make that faster?
This is how I search right now: (q->getPosition() is an inline function that just extracts the 9-bit position from the element via “& bitmask”)
do
{
// binary search algorithm (bsearch)
// from http://www.google.com/codesearch/
// p?hl=en#qoCVjtE_vOw/gcc4/trunk/gcc-
// 4.4.3/libiberty/bsearch.c&q=bsearch&sa=N&cd=2&ct=rc
base = &(block->points[0]);
if (block->next == NULL)
{
pointsInBlock = pointsInLastBlock;
stop = true;
}
else
{
block = block->next;
}
for (lim = pointsInBlock; lim != 0; lim >>= 1)
{
q = base + (lim >> 1);
cmp = quantizedPosition - q->getPosition();
if (cmp > 0)
{
// quantizedPosition > q: move right
base = q + 1;
lim--;
}
else if (cmp == 0)
{
// We found the QuantPoint
*outQuantPoint = q;
return true;
}
// else move left
}
}
while (!stop);
Since the bulk of the time is spent in the within-block search, that needs to be as fast as possible. Since the number of elements is fixed, you can completely unroll that loop, as in:
Study the generated assembly language and single-step it in a debugger, to make sure the compiler’s giving you good code.
You might want to write a little program to print out the above code, as it will be hard to write by hand, or possibly generate it with macros.
ADDED: Just noticed your 9-bit search criterion. In that case, just pre-allocate an array of 512 4-byte words, and index it directly. That’s the fastest, and the least code.
ALSO ADDED: If you need to keep your block structure, there’s another way to do the unrolled binary search. It’s the Jon Bentley method:
That’s slower than the if-tree above, because of manipulating
i, but could be substantially less code.