I know Haskell a little bit, and I wonder if it’s possible to write something like a matrix-matrix product in Haskell that is all of the following:
- Pure-functional: no
IOorStatemonads in its type signature (I don’t care what happens in the function body. That is, I don’t care if the function body uses monads, as long as the whole function is pure). I may want to use this matrix-matrix product in a pure function. - Memory-safe: no malloc or pointers. I know that it’s possible to “write C” in Haskell,
but you lose memory safety. Actually writing this code in C and interfacing it with Haskell also loses memory safety. - As efficient as, say, Java. For concreteness, let’s assume I’m talking about a simple triple loop, single precision, contiguous column-major layout (
float[], notfloat[][]) and matrices of size 1000×1000, and a single-core CPU. (If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark.)
(I don’t want this to sound like a challenge, but note that Java can satisfy all of the above easily.)
I already know that
- The triple loop implementation is not the most efficient one. It’s quite cache-oblivious. It’s better to use a well-written BLAS implementation in this particular case. However, one can not always count on a C library being available for what one is trying to do. I wonder if reasonably efficient code can be written in normal Haskell.
- Some people wrote whole research papers that demonstrate #3. However, I’m not a computer science researcher. I wonder if it’s possible to keep simple things simple in Haskell.
- The Gentle Introduction to Haskell has a matrix product implementation. It wouldn’t satisfy the above requirements though.
Addressing comments:
I have three reasons: first, the “no malloc or pointers” requirement
is as yet ill-defined (I challenge you to write any piece of Haskell
code which uses no pointers);
I saw plenty of Haskell programs not using Ptr. Perhaps it refers to the fact that at the machine instruction level, pointers will be used? That’s not what I meant. I was referring to the abstraction level of the Haskell source code.
second, the attack on CS research is out of place (and furthermore I
can’t imagine anything simpler than using code somebody else has
already written for you); third, there are many matrix packages on
Hackage (and the prep work for asking this question should include
reviewing and rejecting each).
It seems that your #2 and #3 are the same (“use existing libraries”). I’m interested in the matrix product as a simple test of what Haskell can do on its own, and whether it allows you to “keep simple things simple”. I could have easily come up with a numerical problem that doesn’t have any ready libraries, but then I’d have to explain the problem, whereas everyone already knows what a matrix product is.
How can Java possibly satisfy 1.? Any Java method is essentially
:: IORef Arg -> ... -> IORef This -> IO Ret
This goes to the root of my question, actually (+1). While Java does not claim to track purity, Haskell does. In Java, whether the function is pure or not is indicated in the comments. I can claim that the matrix product is pure, even though I do mutation in the function body. The question is whether Haskell’s approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity.
So something like
I suppose? (I hadn’t read the specs properly before coding, so the layout is row-major, but since the access pattern is the same, that doesn’t make a difference as mixing layouts would, so I’ll assume that’s okay.)
I haven’t spent any time on thinking about a clever algorithm or low-level optimisation tricks (I wouldn’t achieve much in Java with those anyway). I just wrote the simple loop, because
And that’s what Java gives easily, so I’ll take that.
Nowhere near, I’m afraid, neither in Java nor in Haskell. Too many cache misses to reach that throughput with the simple triple loop.
Doing the same in Haskell, again no thinking about being clever, a plain straightforward triple loop:
and the calling module
So we’re doing almost exactly the same things in both languages. Compile the Haskell with
-O2, the Java with javacAnd the resulting times are quite close.
And if we compile the Java code to native, with
gcj -O3 -Wall -Wextra --main=MatrixProd -fno-bounds-check -fno-store-check -o jmatProd MatrixProd.java,there’s still no big difference.
As a special bonus, the same algorithm in C (gcc -O3):
So this reveals no fundamental difference between straightforward Java and straightforward Haskell when it comes to computationally intensive tasks using floating point numbers (when dealing with integer arithmetic on medium to large numbers, the use of GMP by GHC makes Haskell outperform Java’s BigInteger by a huge margin for many tasks, but that is of course a library issue, not a language one), and both are close to C with this algorithm.
In all fairness, though, that is because the access pattern causes a cache-miss every other nanosecond, so in all three languages this computation is memory-bound.
If we improve the access pattern by multiplying a row-major matrix with a column-major matrix, all become faster, the gcc-compiled C finishes it 1.18s, java takes 1.23s and the ghc-compiled Haskell takes around 5.8s, which can be reduced to 3 seconds by using the llvm backend.
Here, the range-check by the array library really hurts. Using the unchecked array access (as one should, after checking for bugs, since the checks are already done in the code controlling the loops), GHC’s native backend finishes in 2.4s, going via the llvm backend lets the computation finish in 1.55s, which is decent, although significantly slower than both C and Java. Using the primitives from
GHC.Priminstead of the array library, the llvm backend produces code that runs in 1.16s (again, without bounds-checking on each access, but that only valid indices are produced during the computation can in this case easily be proved before, so here, no memory-safety is sacrificed¹; checking each access brings the time up to 1.96s, still significantly better than the bounds checking of the array library).Bottom line: GHC needs (much) faster branching for the bounds-checking, and there’s room for improvement in the optimiser, but in principle, “Haskell’s approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity”, we’re just not yet there. For the time being, one has to decide how much of which point one is willing to sacrifice.
¹ Yes, that’s a special case, in general omitting the bounds-check does sacrifice memory-safety, or it is at least harder to prove that it doesn’t.