Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8028197
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T00:03:58+00:00 2026-06-05T00:03:58+00:00

My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so

  • 0

My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example.

The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse (each line is an entry), so we can easily write something like:

parse :: Lazy.ByteString -> [LogEntry]

Now, I have a lot of statistics that I would like to compute from the log file. It is easiest to write separate functions such as:

totalEntries = length
nrBots = sum . map fromEnum . map isBotEntry
averageTimeOfDay = histogram . map extractHour

All of these are of the form foldl' k z . map f.

The problem is that if I try to use them in the most natural way, like

main = do
    input <- Lazy.readFile "input.txt"
    let logEntries = parse input
        totalEntries' = totalEntries logEntries
        nrBots' = nrBots logEntries
        avgTOD = averageTimeOfDay logEntries
    print totalEntries'
    print nrBots'
    print avgTOD

This will allocate the whole list in memory, which is not what I want. I want the folds to be done synchronously, so that the cons cells can be garbage collected. If I compute only a single statistic, this is what happens.

I can write a single big function that does this, but it is non-composable code.

Alternatively, which is what I have been doing, I run each pass separately, but this reloads & uncompresses the file each time.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T00:03:59+00:00Added an answer on June 5, 2026 at 12:03 am

    This a comment on the comment of sdcvvc referring to this ‘beautiful folding’ essay It was so cool — beautiful, as he says — I couldn’t resist adding Functor and Applicative instances and a few other bits of modernization. Simultaneous folding of, say, x y and z is a straightforward product: (,,) <$> x <*> y <*> z. I made a half-gigabyte file of small random ints and it took 10 seconds to give the — admittedly trivial — calculation of length, sum and maximum on my rusty laptop. It doesn’t seem to be helped by further annotations, but the compiler could see Int was all I was interested in; the obvious map read . lines as a parser led to a hopeless space and time catastrophe so I unfolded with a crude use of ByteString.readInt; otherwise it is basically a Data.List process.

    {-# LANGUAGE GADTs, BangPatterns #-}
    
    import Data.List (foldl', unfoldr)
    import Control.Applicative 
    import qualified Data.ByteString.Lazy.Char8 as B
    
    main = fmap readInts (B.readFile "int.txt") >>= print . fold allThree
      where allThree = (,,) <$> length_ <*> sum_ <*> maximum_
    
    data Fold b c where  F ::  (a -> b -> a) -> a -> (a -> c) -> Fold b c
    data Pair a b = P !a !b
    
    instance Functor (Fold b) where  fmap f (F op x g) = F op x (f . g)
    
    instance Applicative (Fold b) where
      pure c = F const () (const c)
      (F f x c) <*> (F g y c') = F (comb f g) (P x y) (c *** c')
        where comb f g (P a a') b = P (f a b) (g a' b)
              (***) f g (P x y) = f x ( g y)
    
    fold :: Fold b c -> [b] -> c
    fold (F f x c) bs = c $ (foldl' f x bs)
    
    sum_, product_ :: Num a => Fold a a
    length_ :: Fold a Int
    sum_     = F (+) 0 id
    product_ = F (*) 1 id
    length_  = F (const . (+1)) 0 id
    maximum_ = F max 0 id
    readInts  = unfoldr $ \bs -> case B.readInt bs of
      Nothing      -> Nothing
      Just (n,bs2) -> if not (B.null bs2) then Just (n,B.tail bs2) 
                                          else Just (n,B.empty)
    

    Edit: unsurprisingly, since we have to do with an unboxed type above, and an unboxed vector derived from e.g. a 2G file can fit in memory, this is all twice as fast and somewhat better behaved if it is given the obvious relettering for Data.Vector.Uboxed http://hpaste.org/69270 Of course this isn’t relevant where one has types like LogEntry Note though that the Fold type and Fold ‘multiplication’ generalizes over sequential types without revision, thus e.g. the Folds associated with operations on Chars or Word8s can be simultaneously folded directly over a ByteString. One must first define a foldB, by relettering fold to use the foldl's in the various ByteString modules. But the Folds and products of Folds are the same ones you would fold a list or vector of Chars or Word8s

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Context: I want to use log4j to write audit-related logs to a specific log
Context I use FINDSTR /C:portid= scanports.xml to extract theses lines from a file: <port
Context: Ubuntu 11.10 and libfuse 2.8.4-1.4ubuntu1 Linux 3.0.0-14-generic #23-Ubuntu SMP Mon Nov 21 20:28:43
Context I use jQuery UI autocomplete with a remote datasource. The source send data
context.Response.ContentType = text/plain; context.Response.Write(returnString); If returnString is NULL what will it pass, or will
Context: I downloaded a file (Audirvana 0.7.1.zip) from code.google to my Macbook Pro (Mac
CONTEXT: I'm currently developing a website that will involve Credit Card numbers handling. Confidential
Context I want to display Twitter and Facebook buttons on my new site. But
Context Say, we want to use the Box-Muller algorithm. Starting from a couple of
Context I have 2 different versions of an assembly installed in GAC, version 1.0

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.