I have a program that is running slower than I’d like it to.
I’ve done some profiling, and I’ve found the section that is taking up the vast majority of processing time
DO K = 0, K_MAX
WRITE(EIGENVALUES_IO, *) K * 0.001 * PI, (W_UP(J), J=1, ATOM_COUNT)
DCMPLXW_UP(:) = DCMPLX(W_UP(:))
DO E = 1, ENERGY_STEPS
ENERGY = MIN_ENERGY + ENERGY_STEP * REAL(E, DP)
ZV = DCMPLX(ENERGY, DELTA)
ON_SITE_SINGLE = DCMPLX(0.0_DP)
DO Q = 1, ATOM_COUNT
DO J = 1, ATOM_COUNT
ON_SITE_SINGLE(J) = ON_SITE_SINGLE(J) + (MATRIX_UP(J, Q) * MATRIX_UP_CONJG(J, Q)) / (ZV - DCMPLXW_UP(Q))
END DO
END DO
DOS_DOWN(E) = DOS_DOWN(E) - WEIGHTS(K) * SUM(IMAG(ON_SITE_SINGLE))
END DO
END DO
The line
ON_SITE_SINGLE(J) = ON_SITE_SINGLE(J) + (MATRIX_UP(J, Q) * MATRIX_UP_CONJG(J, Q)) / (ZV - DCMPLXW_UP(Q))
Is the one that is doing the damage.
I’m fairly novice at this, is there some way of speeding this up? AFAIK, the same principles apply with C, so any help from you guys too would be nice.
The arrays are all COMPLEX
K_MAX is 1000
ENERGY_STEPS is 1000
ATOM_COUNT is low ( < 50)
All my programs run slower than I’d like. In all (OK, not all, but many of) my scientific programs there is a deep loop nest in which the innermost statement(s) take up most of the computation time. Typically I expect 90%+ of my computations to be taken up by those statements. That innermost statement of yours is being executed 2.5×10^9 times, so you should expect it to take a significant fraction of the total time.
Bearing this in mind I suggest that you:
a) Take @Alexandre’s advice to use BLAS rather than your home-brewed matrix-vector multiplication.
b) ignore @Yuval’s advice about lifting operations out of the loop – a good Fortran compiler will do this for you if you turn optimisation up high (WARNING: this is a self-fulfilling prophesy in as much as if the compiler doesn’t it is not a good one). There are a lot of other optimisations I expect from a good Fortran these days, see (d). (I don’t expect optimisation of memory access by the compiler, I expect that from BLAS.)
c) Form a realistic expectation of how much performance you should be able to get from your program. If you get a sustained FLOPs rate in excess of 10% of the CPUs rated performance you are doing very well and should spend your time doing other things rather than optimisation.
d) Read your compiler documentation very carefully. Make sure that you understand what the optimisation flags actually do. Make sure that you are generating code for the CPUs you are using, not some older variant. Switch in fast vector operations if they are available. All that sort of thing.
e) Start parallelising. OpenMP is a good place to start and, as @Nicolas indicates, the learning curve is quite gentle at first.
Oh, and advice 0, which you seem to have followed, is to measure the code’s performance and measure the impact of any changes you make.