I am processing frames in a video and displaying it live (real time). The algorithm is fast, but I am wondering if there’s any optimizations that I can do that will make it even more seamless. I don’t know what functions in my algorithm take up the most amount of time, my guess is the sqrt() function because apparently it does some look ups, but i am not sure.
This is my algorithm:
IplImage *videoFrame = cvCreateImage(cvSize(bufferWidth, bufferHeight), IPL_DEPTH_8U, 4);
videoFrame->imageData = (char*)bufferBaseAddress;
int channels = videoFrame->nChannels;
int widthStep = videoFrame->widthStep;
int width = videoFrame->width;
int height = videoFrame->height;
for(int i=0;i<height;i++){
uchar *col = ((uchar *)(videoFrame->imageData + i*widthStep));
for(int j=0;j<width;j++){
double pRed = col[j*channels + 0];
double pGreen = col[j*channels + 1];
double pBlue = col[j*channels + 2];
double dRed = green.val[0] - pRed;
double dGreen = green.val[1] - pGreen;
double dBlue = green.val[2] - pBlue;
double sDRed = dRed * dRed;
double sDGreen = dGreen * dGreen;
double sDBlue = dBlue * dBlue;
double sum = sDRed + sDGreen + sDBlue;
double euc = sqrt(sum);
//NSLog(@"%f %f %f", pRed, pGreen, pBlue);
if (euc < threshold) {
col[j*channels + 0] = white.val[0];
col[j*channels + 1] = white.val[1];
col[j*channels + 2] = white.val[2];
}
}
}
Thanks!
UPDATE
Ok, so what this does is loop throughout every pixel in the image, and calculator the Euclidean distance between the color of the pixel and green color. So, overall this is a green screen algorithm.
I did some benchmarks, and the fps without using this algorithm is 30.0fps. Using this algorithm, it falls down to about 8fps. But, the majority of the for drop comes from col[j*channels + 0]; If the algorithm doesn’t do anything else and use access the array elects, it drops down to about 10fps.
UPDATE 2
Ok this is interesting, I was removing random lines from the stuff inside the double loop to see what causes the bigger overhead and this is what I found: Creating variables on the stack causes HUGE drop in FPS. Consider this example:
for(int i=0;i<height;i++){
uchar *col = ((uchar *)(data + i*widthStep));
for(int j=0;j<width;j++){
double pRed = col[j*channels + 0];
double pGreen = col[j*channels + 1];
double pBlue = col[j*channels + 2];
}
}
This drops the fps to 11-ish.
Now this on the other hand:
for(int i=0;i<height;i++){
uchar *col = ((uchar *)(data + i*widthStep));
for(int j=0;j<width;j++){
col[j*channels + 0];
col[j*channels + 1];
col[j*channels + 2];
}
}
doesn’t drop the FPS at all! The FPS stays at a pretty 30.0. Thought I should update this and let you guys know what’s this is the real bottle neck, making variables not he stack. I wonder if I inline everything I might get a pure 30.0fps.
Nvm…maybe the expressions that aren’t assigned to a var aren’t even evaluated.
sqrtis a monotonically increasing function, and you appear to only be using it in a threshold test.Due to monotonicity,
sqrt(sum) < thresholdis equivalent tosum < threshold * threshold(assuming threshold is positive).No more expensive square root, and the compiler will move the multiplication outside the loop.
As a next step, you can remove the expensive multiply
j * channelsfrom inside the inner loop. The compiler should be smart enough to do it only once and use the result three times, but it’s still a multiply that the rest of the calculation is dependent on, so hurts pipelining.Remember that a multiply is the same as repeated addition? Normally doing more operations is more expensive, but in this case you already have the repetition part, due to the loop. So use:
Next, you have expensive conversions between
ucharanddouble. These aren’t needed for a threshold test: