Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6929393
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T11:18:42+00:00 2026-05-27T11:18:42+00:00

I wrote a storable vector instance for the data type below ( original question

  • 0

I wrote a storable vector instance for the data type below (original question here):

data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16

The code for defining those instances for Storable vector is below. While I am getting very good performance with the code below, I am very much interested in generic suggestions to improve the performance of that storable instance. By generic suggestion, I mean the following:

  • It is not specific to a GHC compiler version. You can assume GHC 6.12.3+ to exclude performance bugs if any present in earlier versions, and relevant to the code here.
  • Platform-specific suggestions are ok. You may assume x86_64 Linux platform.
  • A generic suggestion more in the form of algorithm improvement (big O) is very much valued, than a suggestion that exploits hardware-specific optimizations. But, given a basic operation like peek/poke here, there is not much scope for algorithmic improvement, as far as I can tell (and hence more valuable because it is a scarce commodity 🙂
  • Compiler flags for x86_64 are acceptable (e.g., telling compiler about removing floating point safe check etc.). I am using “-O2 –make” option to compile the code.

If there is any known good library source code that does similar thing (i.e., define storable instances for union/recursive data types), I will be very much interested in checking them.

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
import GHC.Int

data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
                deriving (Show)

instance Storable Atoms where
  sizeOf _ = 1 + sizeOf (undefined :: Int32)
  alignment _ = 1 + alignment (undefined :: Int32)

  {-# INLINE peek #-}
  peek p = do
            let p1 = (castPtr p::Ptr Word8) `plusPtr` 1 -- get pointer to start of the    element. First byte is type of element
            t <- peek (castPtr p::Ptr Word8)
            case t of
              0 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
                    return (I x)
              1 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
                    return (S x)

  {-# INLINE poke #-}
  poke p x = case x of
      I a -> do
              poke (castPtr p :: Ptr Word8) 0
              pokeElemOff (castPtr p1) 0 a
      S a -> do
              poke (castPtr p :: Ptr Word8) 1
              pokeElemOff (castPtr p1) 0 a
      where  p1 = (castPtr p :: Ptr Word8) `plusPtr` 1 -- get pointer to start of the     element. First byte is type of element

Update:

Based on feedback from Daniel and dflemstr, I rewrote the alignment, and also, updated the constructor to be of type Word32 instead of Word8. But, it seems that for this to be effective, the data constructor too should be updated to have unpacked values – that was an oversight on my part. I should have written data constructor to have unpacked values in the first place (see performance slides by John Tibbell – slide #49). So, rewriting the data constructor, coupled with alignment and constructor changes, made a big impact on the performance, improving it by about 33% for functions over vector (a simple sum function in my benchmark test). Relevant changes below (warning – not portable but it is not an issue for my use case):

Data constructor change:

data Atoms = I {-# UNPACK #-} !GHC.Int.Int32 | S {-# UNPACK #-} !GHC.Int.Int16

Storable sizeof and alignment changes:

instance Storable Atoms where
  sizeOf _ = 2*sizeOf (undefined :: Int32)
  alignment _ = 4

  {-# INLINE peek #-}
  peek p = do
            let p1 = (castPtr p::Ptr Word32) `plusPtr` 1
            t <- peek (castPtr p::Ptr Word32)
            case t of
              0 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
                    return (I x)
              _ -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
                    return (S x)

  {-# INLINE poke #-}
  poke p x = case x of
      I a -> do
              poke (castPtr p :: Ptr Word32) 0
              pokeElemOff (castPtr p1) 0 a
      S a -> do
              poke (castPtr p :: Ptr Word32) 1
              pokeElemOff (castPtr p1) 0 a
      where  p1 = (castPtr p :: Ptr Word32) `plusPtr` 1
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T11:18:43+00:00Added an answer on May 27, 2026 at 11:18 am

    If speed is what you’re after, then this kind of bit packing isn’t the right direction to go in.

    A processor always deals with word-sized operations, meaning that if you have e.g. a 32-bit processor, the smallest amount of memory that the processor can (physically) deal with is 32 bits or 4 bytes (and for 64-bit processors its 64 bits or 8 bytes). Further; a processor can only load memory at word-boundaries, meaning at byte addresses that are multiples of the word size.

    So if you use an alignment of 5 (in this case), it means that your data is stored like this:

    |  32 bits  |  32 bits  |  32 bits  |  32 bits  |
     [    data    ] [    data    ] [    data    ]
     00 00 00 00 01 01 00 01 00 00 00 12 34 56 78 00
     IX Value       IX Value XX XX IX Value
    
    IX = Constructor index
    Value = The stored value
    XX = Unused byte
    

    As you can see, the data gets more and more out of sync with the word boundaries, making the processor/program have to do more work to access each element.

    If you increase your alignment to 8 (64 bits), your data will be stored like this:

    |  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |
     [    data    ]          [    data    ]          [    data    ]
     00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78 00 00 00
     IX Value       XX XX XX IX Value XX XX XX XX XX IX Value       XX XX XX
    

    This makes you “waste” 3 bytes per element, but your data structure will be much faster, since each datum can be loaded and interpreted with far fewer instructions and aligned memory loads.

    If you are going to use 8 bytes anyways, you might as well make your constructor index to a Int32, since you aren’t using those bytes for anything else anyways, and making all of your datum elements word-aligned further increases speed:

    |  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |
     [        data         ] [        data         ] [        data         ]
     00 00 00 00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78
     Index       Value       Index       Value XX XX Index       Value
    

    This is the price you have to pay for a faster data structures on current processor architectures.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a data structure of the form below (V is Data.Storable.Vector): data Elems
I am trying to write a storable vector instance for CString (null-terminated C chars
I wrote a windows service using VB that read some legacy data from Visual
I wrote an application that currently runs against a local instance of MySql. I
I wrote a makefile which behaves oddly. You can find it here: http://pastebit.com/pastie/8215 Basically
I wrote a small sparse matrix class with the member: std::map<int,std::map<int,double> > sm; The
i need to know how can i resolve Unions and Type** (e.g. int**) with
I wrote an android app, which requires large amount of data to be transfered
I wrote some code to generate a boost::mpl::vector to use as a lookup table
I wrote simple application that send some data to my server. Somehow the data

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.