Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 903985
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T15:58:38+00:00 2026-05-15T15:58:38+00:00

I’m writing a game in Haskell, and my current pass at the UI involves

  • 0

I’m writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):

Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
    vecList[i] = vecList[i] * multiplier + addend;

That is, a bog-standard multiply-add of four floats, the kind of thing ripe for SIMD optimization.

The result is going to an OpenGL vertex buffer, so it has to get dumped into a flat C array eventually. For the same reason, the calculations should probably be done on C ‘float’ types.

I’ve looked for either a library or a native idiomatic solution to do this sort of thing quickly in Haskell, but every solution I’ve come up with seems to hover around 2% of the performance (that is, 50x slower) compared to C from GCC with the right flags. Granted, I started with Haskell a couple weeks ago, so my experience is limited—which is why I’m coming to you guys. Can any of you offer suggestions for a faster Haskell implementation, or pointers to documentation on how to write high-performance Haskell code?

First, the most recent Haskell solution (clocks about 12 seconds). I tried the bang-patterns stuff from this SO post, but it didn’t make a difference AFAICT. Replacing ‘multAdd’ with ‘(\i v -> v * 4)’ brought execution time down to 1.9 seconds, so the bitwise stuff (and consequent challenges to automatic optimization) doesn’t seem to be too much at fault.

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign.C.Types
import Data.Bits

repCount = 10000
arraySize = 20000

a = fromList $ [0.2::CFloat,  0.1, 0.6, 1.0]
m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6]

multAdd :: Int -> CFloat -> CFloat
multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3))

multList :: Int -> Vector CFloat -> Vector CFloat
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.imap multAdd src

main = do
    print $ Data.Vector.Storable.sum $ multList repCount $ 
        Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)

Here’s what I have in C. The code here has a few #ifdefs which prevents it from being compiled straight-up; scroll down for the test driver.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef float v4fs __attribute__ ((vector_size (16)));
typedef struct { float x, y, z, w; } Vector4;

void setv4(v4fs *v, float x, float y, float z, float w) {
    float *a = (float*) v;
    a[0] = x;
    a[1] = y;
    a[2] = z;
    a[3] = w;
}

float sumv4(v4fs *v) {
    float *a = (float*) v;
    return a[0] + a[1] + a[2] + a[3];
}

void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) {
    for (int j = 0; j < N; j++) {
        d[j] = s[j] * m + a;
    }
}

void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d,
             Vector4 a, Vector4 m) {
    for (int j = 0; j < (N*4); j+=4) {
        d[j+0] = s[j+0] * m.x + a.x;
        d[j+1] = s[j+1] * m.y + a.y;
        d[j+2] = s[j+2] * m.z + a.z;
        d[j+3] = s[j+3] * m.w + a.w;
    }
}

int main () {
    v4fs a, m;
    v4fs *s, *d;

    setv4(&a, 0.2, 0.1, 0.6, 1.0);
    setv4(&m, 0.99, 0.7, 0.8, 0.6);

    s = calloc(N, sizeof(v4fs));
    d = s;

    double start = clock();
    for (int i = 0; i < M; i++) {

#ifdef COPY
        d = malloc(N * sizeof(v4fs));
#endif

#ifdef VECTOR
        vecmult(s, d, a, m);
#else
        Vector4 aa = *(Vector4*)(&a);
        Vector4 mm = *(Vector4*)(&m);
        scamult((float*)s, (float*)d, aa, mm);
#endif

#ifdef COPY
        free(s);
        s = d;
#endif
    }
    double end = clock();

    float sum = 0;
    for (int j = 0; j < N; j++) {
        sum += sumv4(s+j);
    }
    printf("%-50s %2.5f %f\n\n", NAME,
            (end - start) / (double) CLOCKS_PER_SEC, sum);
}

This script will compile and run the tests with a number of gcc flag combinations. The best performance was had by cmath-64-native-O3-restrict-vector-nocopy on my system, taking 0.22 seconds.

import System.Process
import GHC.IOBase

cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000")
cOptions = [
            [("32", "-m32"), ("64", "-m64")],
            [("generic", ""), ("native", "-march=native -msse4")],
            [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")],
            [("restrict", "-DMAYBE_RESTRICT=__restrict__"),
                ("norestrict", "-DMAYBE_RESTRICT=")],
            [("vector", "-DVECTOR"), ("scalar", "")],
            [("copy", "-DCOPY"), ("nocopy", "")]
           ]

-- Fold over the Cartesian product of the double list. Probably a Prelude function
-- or two that does this, but hey. The 'perm' referred to permutations until I realized
-- that this wasn't actually doing permutations. '
permfold :: (a -> a -> a) -> a -> [[a]] -> [a]
permfold f z [] = [z]
permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x

prepCmd :: (String, String) -> (String, String) -> (String, String)
prepCmd (name, cmd) (namea, cmda) =
    (name ++ "-" ++ namea, cmd ++ " " ++ cmda)

runCCmd name compileCmd = do
    res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name)
    if res == ExitSuccess
        then do system ("./" ++ name)
                return ()
        else    putStrLn $ name ++ " did not compile"

main = do
    mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T15:58:38+00:00Added an answer on May 15, 2026 at 3:58 pm

    Roman Leschinkskiy responds:

    Actually, the core looks mostly ok to
    me. Using unsafeIndex instead of (!)
    makes the program more than twice as
    fast (see my answer above). The
    program below is much faster, though
    (and cleaner, IMO). I suspect the
    remaining difference between this and
    the C program is due to GHC’s general
    suckiness when it comes to floating
    point. The HEAD produces the
    best results with the NCG and -msse2

    First, define a new Vec4 data type:

    {-# LANGUAGE BangPatterns #-}
    
    import Data.Vector.Storable
    import qualified Data.Vector.Storable as V
    import Foreign
    import Foreign.C.Types
    
    -- Define a 4 element vector type
    data Vec4 = Vec4 {-# UNPACK #-} !CFloat
                     {-# UNPACK #-} !CFloat
                     {-# UNPACK #-} !CFloat
                     {-# UNPACK #-} !CFloat
    

    Ensure we can store it in an array

    instance Storable Vec4 where
      sizeOf _ = sizeOf (undefined :: CFloat) * 4
      alignment _ = alignment (undefined :: CFloat)
    
      {-# INLINE peek #-}
      peek p = do
                 a <- peekElemOff q 0
                 b <- peekElemOff q 1
                 c <- peekElemOff q 2
                 d <- peekElemOff q 3
                 return (Vec4 a b c d)
        where
          q = castPtr p
      {-# INLINE poke #-}
      poke p (Vec4 a b c d) = do
                 pokeElemOff q 0 a
                 pokeElemOff q 1 b
                 pokeElemOff q 2 c
                 pokeElemOff q 3 d
        where
          q = castPtr p
    

    Values and methods on this type:

    a = Vec4 0.2 0.1 0.6 1.0
    m = Vec4 0.99 0.7 0.8 0.6
    
    add :: Vec4 -> Vec4 -> Vec4
    {-# INLINE add #-}
    add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')
    
    mult :: Vec4 -> Vec4 -> Vec4
    {-# INLINE mult #-}
    mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')
    
    vsum :: Vec4 -> CFloat
    {-# INLINE vsum #-}
    vsum (Vec4 a b c d) = a+b+c+d
    
    multList :: Int -> Vector Vec4 -> Vector Vec4
    multList !count !src
        | count <= 0    = src
        | otherwise     = multList (count-1) $ V.map (\v -> add (mult v m) a) src
    
    main = do
        print $ Data.Vector.Storable.sum
              $ Data.Vector.Storable.map vsum
              $ multList repCount
              $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)
    
    repCount, arraySize :: Int
    repCount = 10000
    arraySize = 20000
    

    With ghc 6.12.1, -O2 -fasm:

    • 1.752

    With ghc HEAD (june 26), -O2 -fasm -msse2

    • 1.708

    This looks like the most idiomatic way to write a Vec4 array, and gets the best performance (11x faster than your original). (And this might become a benchmark for GHC’s LLVM backend)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.