I’m playing a bit with auto parallelization in ICC (11.1; old, but can’t do anything about it) and I’m wondering why the compiler can’t parallelize the inner loop for a simple gaussian elimination:
void makeTriangular(float **matrix, float *vector, int n) {
for (int pivot = 0; pivot < n - 1; pivot++) {
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
for (int row = pivot + 1; row < n; row++) { // line 72; should be parallelized
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) { // line 74
matrix[row][col] -= matrix[pivot][col] * tmp;
}
vector[row] -= vector[pivot] * tmp;
}
}
}
We’re only writing to the arrays dependent on the private row (and col) variable and row is guaranteed to be larger than pivot, so it should be obvious to the compiler that we aren’t overwriting anything.
I’m compiling with -O3 -fno-alias -parallel -par-report3 and get lots of dependencies ala: assumed FLOW dependence between matrix line 75 and matrix line 73. or assumed ANTI dependence between matrix line 73 and matrix line 75. and the same for line 75 alone. What problem does the compiler have? Obviously I could tell it exactly what to do with some pragmas, but I want to understand what the compiler can get alone.
Basically the compiler can’t figure out that there’s no dependency due to the name
matrixand the namevectorbeing both read from and written too (even though with different regions). You might be able to get around this in the following fashion (though slightly dirty):Bottom line is just give the ones you’re writing to a temporarily different name to trick the compiler. I know that it’s a little dirty and I wouldn’t recommend this kind of programming in general. But if you’re sure that you have no data dependency, it’s perfectly fine.
In fact, I’d put some comments around it that are very clear to future people who see this code that this was a workaround, and why you did it.
Edit: I think the answer was basically touched on by @FPK and an answer was posted by @Evgeny Kluev. However, in @Evgeny Kluev’s answer he suggests making this an input parameter and that might parallelize but won’t give the correct value since the entries in
matrixwon’t be updated. I think the code I posted above will give the correct answer too.