Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8299725
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T16:16:06+00:00 2026-06-08T16:16:06+00:00

I’m trying to optimize some code, using criterion to try to compare, for example,

  • 0

I’m trying to optimize some code, using criterion to try to compare, for example, the effect of adding INLINE pragma to a function. But I’m finding results are not consistent between re-compiles/runs.

I need to know how to get results either to be consistent across runs so that I can compare them, or how to assess whether a benchmark is reliable or not, i.e. (I guess) how to interpret the details about variance, “cost of a clock call”, etc.

Details on my particular case

This is orthogonal to my main questions above, but a couple things might be causing inconsistency in my particular case:

  1. I’m trying to benchmark IO actions using whnfIO because the method using whnf in this example didn’t work.

  2. my code uses concurrency

  3. I’ve got a lot of tabs and crap open

Example output

Both of these are from the same code, compiled in the exact same way. I did the first run directly below, made a change and did another benchmark, then reverted and ran the first code again, compiling with:

ghc --make -fforce-recomp -threaded -O2 Benchmark.hs

First run:

estimating clock resolution...                                      
mean is 16.97297 us (40001 iterations)                              
found 6222 outliers among 39999 samples (15.6%)                     
  6055 (15.1%) high severe                                          
estimating cost of a clock call...                                  
mean is 1.838749 us (49 iterations)                                 
found 8 outliers among 49 samples (16.3%)                           
  3 (6.1%) high mild                                                
  5 (10.2%) high severe                                             

benchmarking actors/insert 1000, query 1000                         
collecting 100 samples, 1 iterations each, in estimated 12.66122 s  
mean: 110.8566 ms, lb 108.4353 ms, ub 113.6627 ms, ci 0.950         
std dev: 13.41726 ms, lb 11.58487 ms, ub 16.25262 ms, ci 0.950      
found 2 outliers among 100 samples (2.0%)                           
  2 (2.0%) high mild                                                
variance introduced by outliers: 85.211%                            
variance is severely inflated by outliers                           

benchmarking actors/insert 1000, query 100000                       
collecting 100 samples, 1 iterations each, in estimated 945.5325 s  
mean: 9.319406 s, lb 9.152310 s, ub 9.412688 s, ci 0.950            
std dev: 624.8493 ms, lb 385.4364 ms, ub 956.7049 ms, ci 0.950      
found 6 outliers among 100 samples (6.0%)                           
  3 (3.0%) low severe                                               
  1 (1.0%) high severe                                              
variance introduced by outliers: 62.576%                            
variance is severely inflated by outliers

Second run, ~3x slower:

estimating clock resolution...
mean is 51.46815 us (10001 iterations)
found 203 outliers among 9999 samples (2.0%)
  117 (1.2%) high severe
estimating cost of a clock call...
mean is 4.615408 us (18 iterations)
found 4 outliers among 18 samples (22.2%)
  4 (22.2%) high severe

benchmarking actors/insert 1000, query 1000
collecting 100 samples, 1 iterations each, in estimated 38.39478 s
mean: 302.4651 ms, lb 295.9046 ms, ub 309.5958 ms, ci 0.950
std dev: 35.12913 ms, lb 31.35431 ms, ub 42.20590 ms, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 84.163%
variance is severely inflated by outliers

benchmarking actors/insert 1000, query 100000
collecting 100 samples, 1 iterations each, in estimated 2644.987 s
mean: 27.71277 s, lb 26.95914 s, ub 28.97871 s, ci 0.950
std dev: 4.893489 s, lb 3.373838 s, ub 7.302145 s, ci 0.950
found 21 outliers among 100 samples (21.0%)
  4 (4.0%) low severe
  3 (3.0%) low mild
  3 (3.0%) high mild
  11 (11.0%) high severe
variance introduced by outliers: 92.567%
variance is severely inflated by outliers

I notice that if I scale by “estimated cost of a clock call” the two benchmarks are fairly close. Is that what I should do to get a real number for comparing?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T16:16:07+00:00Added an answer on June 8, 2026 at 4:16 pm

    Although there’s certainly not enough information here to pinpoint every issue, I have a few suggestions that may help.

    Interpreting Criterion results

    The problem with the samples identified as outliers is that criterion can’t really tell if they’re outliers because they’re junk data, or if they’re valid data that’s different for some legitimate reason. It can strongly hint that they’re junk (the “variance is severely inflated” line), but what this really means is that you need to investigate your testing environment, your tests, or your application itself to determine the source of the outliers. In this case it’s almost certainly caused by system load (based on other information you’ve provided).

    You might be interested to read BOS’s announcement of criterion, which explains how it works in quite a bit more detail and goes through some examples of exactly how system load affects the benchmarking process.

    I’m very suspicious of the difference in the “estimated cost of a clock call”. Notice that there is a high proportion of outliers (in both runs), and those outliers have a “high severe” impact. I would interpret this to mean that the clock timings criterion picked up are junk (probably in both runs), making everything else unreliable too. As @DanielFischer suggests, closing other applications may help this problem. Worst case might be a hardware problem. If you close all other applications and the clock timings are still unreliable, you may want to test on another system.

    If you’re running multiple tests on the same system, the clock timings and cost should be fairly consistent from run to run. If they aren’t, something is affecting the timings, resulting in unreliable data.

    Aside from that, here are two random ideas that may be a factor.

    CPU load

    The threaded runtime can be sensitive to CPU load. The default RTS values work well for many applications unless your system is under heavy load. The problem is that there are a few critical sections in the garbage collector, so if the Haskell runtime is resource starved (because it’s competing for CPU or memory with other applications), all progress can be blocked waiting for those sections. I’ve seen this affect performance by a factor of 2.5, which is more or less in line with the three-fold difference you see.

    Even if you don’t have issues with the garbage collector, high CPU load from other applications will skew your results and should be eliminated if possible.

    how to diagnose

    • Use top or other system utilities to check CPU load.
    • Run with +RTS -s. At the bottom of the statics, look for these lines

    -RTS -s output

    gc_alloc_block_sync: 0
    whitehole_spin: 0
    gen[0].sync: 0
    gen[1].sync: 0
    

    non-zero values indicate resource contention in the garbage collector. Large values here indicate a serious problem.

    how to fix

    • close other applications
    • specify that your executable should use fewer than all cores (e.g. +RTS -N6 or +RTS -N7 on an 8-core box)
    • disable parallel garbage collection (with +RTS -qg). I’ve usually had better results by leaving a free core than disabling the parallel collector, but YMMV.

    I/O

    If the functions you’re benchmarking are doing any sort of I/O (disk, network, etc.), you need to be very careful in how you interpret the results. Disk I/O can cause huge variances in performance. If you run the same function for 100 samples, after the first run any I/O might be cached by the controller. Or you may have to do a head seek if another file was accessed between runs. Other I/O typically isn’t any better.

    how to diagnose

    • you probably already know if your function is doing I/O.
    • tools like lsof can help track down mysterious I/O performance

    how to fix

    • mock the I/O. Create a ramdisk. Anything other than actually going to the hard drive etc.
    • If you really must benchmark real I/O operations, minimize interference from other applications. Maybe use a dedicated drive. Close other apps. Definitely collect multiple samples, and pay attention to the variance between them.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I have a .ini file as follows: [playlist] numberofentries=2 File1=http://87.230.82.17:80 Title1=(#1 - 365/1400) Example
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and
For some reason, after submitting a string like this Jack’s Spindle from a text
I am reading a book about Javascript and jQuery and using one of the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.