Once again, I have a problem for which I would like to shave off nanoseconds. I have a small, constant array and I would like to search it to see if a given number is a member*.
Input: A 64-bit number n.
Output: True if n is in the array, false if n is not.
What are good techniques for making binary searches fast, given the possibility to optimize for the specific elements and their distribution.
Specifics
I have an array with around 136 members (though see below: there’s some flexibility) to search. The members are not equally distributed through the range: they cluster toward the beginning and end of the range. The input numbers can be assumed to the chosen with uniform probability. It’s probably worthwhile to take advantage of this irregularity.
Here’s a sample picture of the distribution for the 136-element array. Note that only 12 of the 136 elements are between 1% and 99% of the range; the balance are below 1% or over 99%.

(source: crg4.com)
I assume that branch misprediction will be the largest cost of any implementation. I’d be happy to be proved wrong.
Notes
*
Actually, I have two arrays. Actually actually, I have a choice of what arrays to use: efficiency suggests that the first should have perhaps 10-40 members, while the second can have no more than (exactly) 136 members. My problem gives real flexibility in selecting sizes, along with limited freedom to decide precisely which members to use. If a method performs better with certain sizes or restrictions, please mention this because I may be able to use it. All things equal, I’d prefer to have the second array as large as possible. For reasons unrelated to the binary search I may need to reduce the size of the second array to <= 135 or <= 66 (this is related to the difficulty of determining the input number, which depends on the array selected).
Here’s one of the possible arrays, if it helps in testing ideas. (This pretty well reveals my purpose…!) Don’t jump to unwarranted conclusions on the basis of the first few members, though.
0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049, 12586269025, 20365011074, 32951280099, 53316291173, 86267571272, 139583862445, 225851433717, 365435296162, 591286729879, 956722026041, 1548008755920, 2504730781961, 4052739537881, 6557470319842, 10610209857723, 17167680177565, 27777890035288, 44945570212853, 72723460248141, 117669030460994, 190392490709135, 308061521170129, 498454011879264, 806515533049393, 1304969544928657, 2111485077978050, 3416454622906707, 5527939700884757, 8944394323791464, 14472334024676221, 23416728348467685, 37889062373143906, 61305790721611591, 99194853094755497, 160500643816367088, 259695496911122585, 420196140727489673, 679891637638612258, 1100087778366101931, 1779979416004714189, 2880067194370816120, 4660046610375530309, 7540113804746346429, 9320093220751060618, 9999984858389672876, 10259680355300795461, 10358875208395550958, 10396764270768694864, 10411236604793371085, 10416764544494255842, 10418876029572233892, 10419682545105283285, 10419990606626453414, 10420108275656914408, 10420153221227127261, 10420170388907304826, 10420176946377624668, 10420179451108406629, 10420180407830432670, 10420180773265728832, 10420180912849591277, 10420180966165882450, 10420180986530893524, 10420180994309635573, 10420180997280850646, 10420180998415753816, 10420180998849248253, 10420180999014828394, 10420180999078074380, 10420180999102232197, 10420180999111459662, 10420180999114984240, 10420180999116330509, 10420180999116844738, 10420180999117041156, 10420180999117116181, 10420180999117144838, 10420180999117155784, 10420180999117159965, 10420180999117161562, 10420180999117162172, 10420180999117162405, 10420180999117162494, 10420180999117162528, 10420180999117162541, 10420180999117162546, 10420180999117162548
I will initially run the program on a Phenom II x4, but optimizations for other architectures are welcome.
As a very simple possible optimization, create a 256-entry lookup table for the most significant 8 bits of your 64 bit value. Each row of the table stores indexes in the actual array of the lower and upper bounds of values with those most significant 8 bits. You only need to search this region of the array.
If your array values were uniformly distributed, all the regions would be about the same length, and this wouldn’t provide much gain (if any), it’s not much different from an interpolation search. Since your values are so skewed, most of the 256 entries will point to very short regions (near the middle) which are fast to binary search, or even 0-sized regions. 2 or 3 entries at each end will point to much larger regions of the array, which then will take relatively longer to search through (almost as long as a binary search of the whole array). Since your inputs are uniformly distributed, the average time spent searching will be reduced, and hopefully this reduction is greater than the cost of the initial lookup. Your worst-case might well end up slower, though.
To refine this, you might have a 2-level lookup table on 4 bits at a time. The first level either says “search between these indices”, or else “look up the next 4 significant bits in this second-level table”. The former is fine for middling values, where 16 times the value-range still corresponds to a very small index-range, and so is still quick to search. The latter would be for the ends of the range where the search space is larger. Total size of tables would be smaller, which may or may not give better performance due to better caching of less data. The tables themselves could be generated at runtime, or at compile-time if you’re willing to generate C code once the array values are known. You could even code the lookup table as a giant switch-statement from hell, just to see if it speeds things up or not.
If you haven’t already, you should also benchmark an interpolation search rather than a simple binary chop once you start searching in the array.
Note that I’ve worked to reduce the number of comparisons made in the binary search, rather than specifically the number of branch mispredictions. The two are sort of proportional anyway – you can’t avoid that each time you halve the possibilities in a binary search, you’ll get a misprediction in something like 50% of cases. If you really wanted to minimize mispredictions, then a linear search guarantees only one misprediction per lookup (the one that breaks the loop). That ain’t faster in general, but you could experiment to see whether there’s a size for the remaining array to be searched, below which you should switch to a linear search, perhaps unrolled, perhaps fully unrolled. There may be some other much cleverer hybrid linear/binary search that can be tuned for the relative cost of a successful vs. unsuccessful comparison, but if so I don’t know it.