I need to move a small 2D array of values around a much larger 2D array of values, and set any values of the larger array that are greater than the corresponding values in the smaller array to the values of the smaller array. Think image compositing, sort of, but using two 2D arrays of floats. I need to do this a ton of times as fast as possible. Just wondering if there is some way to optimize using NEON Assembly, the Accelerate framework or some other method I haven’t heard of. Is anything going to be much faster than a double nested for loop to compare and replace values? For example, would it possibly be faster to store the values as a 1D array instead of a 2D array? Or faster to access the values across rows rather than down each column? Just trying to squeeze out any extra speed I can get, but not sure how.
Share
I don’t know of any functions in the Accelerate framework that will do what you want. You can definitely use NEON to accelerate it, without going directly to assembly language, using the
vmin_f32intrinsic to process two pairs of floats at a time, or usingvminq_f32to process four pairs at a time.These links might help get you started using the intrinsics, but I don’t really have any better advice for you:
How to use the multiply and accumulate intrinsics in ARM Cortex-a8?
ARM Information Center – NEON Intrinsics
ARM NEON Optimization. An Example
I found those by googling
neon intrinsics tutorial.Also, the developer tools package includes some ARM architecture documentation:
Xcode 4.2:
/Developer/Library/PrivateFrameworks/DTISAReferenceGuide.framework/Versions/A/Resources/ARMISA.pdfXcode 4.3:
/Applications/Xcode.app/Contents/Applications/Instruments.app/Contents/Frameworks/DTISAReferenceGuide.framework/Versions/A/Resources/ARMISA.pdf