In this post Why is processing a sorted array faster than random array, it says that branch predicton is the reason of the performance boost in sorted arrays.
But I just tried the example using Python; and I think there is no difference between sorted and random arrays (I tried both bytearray and array; and use line_profile to profile the computation).
Am I missing something?
Here is my code:
from array import array
import random
array_size = 1024
loop_cnt = 1000
# I also tried 'array', and it's almost the same
a = bytearray(array_size)
for i in xrange(array_size):
a.append(random.randint(0, 255))
#sorted
a = sorted(a)
@profile
def computation():
sum = 0
for i in xrange(loop_cnt):
for j in xrange(size):
if a[j] >= 128:
sum += a[j]
computation()
print 'done'
I may be wrong, but I see a fundamental difference between the linked question and your example: Python interprets bytecode, C++ compiles to native code.
In the C++ code that
iftranslates directly to acmp/jlsequence, that can be considered by the CPU branch predictor as a single “prediction spot”, specific to that cycle.In Python that comparison is actually several function calls, so there’s (1) more overhead and (2) I suppose the code that performs that comparison is a function into the interpreter used for every other integer comparison – so it’s a “prediction spot” not specific to the current block, which gives the branch predictor a much harder time to guess correctly.
Edit: also, as outlined in this paper, there are way more indirect branches inside an interpreter, so such an optimization in your Python code would probably be buried anyway by the branch mispredictions in the interpreter itself.