I develop a Lattice Boltzmann (Fluid dynamics) code using F#. I am now testing the code on a 24 cores, 128 GB memory server. The code basically consists of one main recursive function for time evolution and inside a System.Threading.Tasks.Parallel.For loop for a 3D dimensional space iteration. The 3D space is 500x500x500 large and one time cycle takes for ever :).
let rec timeIterate time =
// Time consuming for loop
System.Threading.Tasks.Parallel.For(...)
I would expect the server to use all 24 cores i.e. to have 100% usage. What I observe is something between 1% – 30% usage.
And my questions are:
- Is F# an appropriate tool for HPC computations on such servers?
- Is it realistic to use up to 100% of CPU for a real world problem?
- What should I do to obtain a high speed up? Everything is in one big parallel for loop so I would expect that it is all what I should do…
- If F# is NOT an appropriate language, what language is?
Thank you for any suggestions.
EDIT: I am willing to share the code if anyone is interested to take a look at it.
EDIT2: Here is the stripped version of the code: http://dl.dropbox.com/u/4571/LBM.zip
It does not do anything reasonable and I hope I have not introduced any bugs by stripping the code 🙂
The startup file is ShearFlow.fs and at the of the file bottom is
let rec mainLoop (fA: FArrayO) (mR: MacroResult) time =
let a = LBM.Lbm.lbm lt pA getViscosity force g (fA, mR)
It (F#), as a language, can encourage code which works well in parallel — at least part of this is a reduction of state mutability and higher-order functions — this is a can and not a will. However, with HPC there are many specialty programming languages/compilers and/or ways of load distribution (e.g. shared unified memory or distributed micro-kernels). F# is merely a general-purpose programming language: it may or may not have access (e.g. bindings may or may not exist) to the various techniques. (This applies even to non-distributed parallel computing.)
It depends on what the limiting factor is. Talking to my friend who does
5k+100k+ core HPC research and development, the exchange of data and idle times are normally the limiting factor (of course, this is a much higher n 🙂 and so even small improvements in IO reduction (efficiency or different algorithm) can lead to significant gains. Don’t forget the cost of simply moving data between CPUs/caches on the same machine! And, of course, the ever-slow disk IO…Find out where the slow part(s) is(are) and fix it(them) 🙂 E.g. run a profile analysis. Keep in mind it may require using an entirely different algorithm or approach.
While I am not arguing for it, my PhD friend uses/works on Charm++: it is a very focused language for distributed parallel computing (not the environment in question, but I’m trying to make a point 🙂 — F# tries to be a decent general-purpose language.