Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6653949
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T01:19:54+00:00 2026-05-26T01:19:54+00:00

I’m implementing a motif finding algorithm from the domain of bioinformatics using Haskell. I

  • 0

I’m implementing a motif finding algorithm from the domain of bioinformatics using Haskell. I wont go into the details of the algorithm other then to say it’s branch and bound median string search. I had planned on making my implementation more interesting by implementing a concurrent approach (and later an STM approach) in order to get a multicore speed up but after compiling with the follow flags

$ ghc -prof -auto-all -O2 -fllvm -threaded -rtsopts --make main 

and printing the profile I saw something interesting (and perhaps obvious):

COST CENTRE      entries  %time %alloc  
hammingDistance  34677951  47.6   14.7  
motifs           4835446   43.8   71.1  

It’s clear that a remarkable speedup could be gained without going anywhere near multicore programming (although that’s been done and I just need to find some good test data and sort out Criterion for that).

Anyway, both of these functions are purely functional and in no way concurrent. They’re also doing quite simple stuff, so I was surprised that they took so much time. Here’s the code for them:

data NukeTide = A | T | C | G deriving (Read, Show, Eq, Ord, Enum)

type Motif = [NukeTide] 

hammingDistance :: Motif -> Motif -> Int
hammingDistance [] [] = 0
hammingDistance xs [] = 0 -- optimistic
hammingDistance [] ys = 0 -- optimistic
hammingDistance (x:xs) (y:ys) = case (x == y) of
    True  -> hammingDistance xs ys
    False -> 1 + hammingDistance xs ys

motifs :: Int -> [a] -> [[a]]
motifs n nukeTides = [ take n $ drop k nukeTides | k <- [0..length nukeTides - n] ]

Note that of the two arguments to hammingDistance, I can actually assume that xs is going to be x long and that ys is going to be less than or equal to that, if that opens up room for improvements.

As you can see, hammingDistance calculates the hamming distance between two motifs, which are lists of nucleotides. The motifs function takes a number and a list and returns all the sub strings of that length, e.g.:

> motifs 3 "hello world"
["hel","ell","llo","lo ","o w"," wo","wor","orl","rld"]

Since the algorithmic processes involved are so simple I can’t think of a way to optimize this further. I do however have two guesses as to where I should be headed:

  1. HammingDistance: The data types I’m using (NukeTides and []) are slow/clumsy. This is just a guess, since I’m not familiar with their implementations but I think defining my own datatype, although more legible, probably involves more overhead then I intend. Also the pattern matching is foreign to me, I don’t know if that is trivial or costly.
  2. Motifs: If I’m reading this correctly, 70% of all memory allocations are done by motifs, and I’d assume that has to be garbage collected at some time. Again using the all purpose list might be slowing me down or the list comprehension, since the cost of that is incredibly unclear to me.

Does anybody have any advice on the usual procedure here? If data types are the problem, would arrays be the right answer? (I’ve heard they come in boxes)

Thanks for the help.

Edit: It just occurred to me that it might be useful if I describe the manner in which these two functions are called:

totalDistance :: Motif -> Int
totalDistance motif = sum $ map (minimum . map (hammingDistance motif) . motifs l) dna

This function is the result of another function, and is passed around nodes in a tree. At each node in the tree an evaluation of the nucleotide (of length <= n, that is if == n then it is a leaf node) is done, using totalDistance to score the node. From then on it’s your typical branch and bound algorithm.

Edit: John asked that I print out the change I made which virutally eliminated the cost of motifs:

scoreFunction :: DNA -> Int -> (Motif -> Int)
scoreFunction dna l = totalDistance
    where
        -- The sum of the minimum hamming distance in each line of dna
        -- is given by totalDistance motif
        totalDistance motif = sum $ map (minimum . map (hammingDistance motif)) possibleMotifs
        possibleMotifs = map (motifs l) dna -- Previously this was computed in the line above

I didn’t make it clear in my original post, but scoreFunction is only called once, and the result is passed around in a tree traversal/branch and bound and used to evaluate nodes. Recomputing motifs at every step of the way, in retrospect, isn’t one of the brightest things I’ve done.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T01:19:55+00:00Added an answer on May 26, 2026 at 1:19 am

    Your definition of hammingDistance is probably much less efficient than it could be.

    hammingDistance (x:xs) (y:ys) = case (x == y) of
        True  -> hammingDistance xs ys
        False -> 1 + hammingDistance xs ys
    

    Because of haskell’s laziness, this will be expanded to (in the worst case):

    (1 + (1 + (1 + ...)))
    

    which will exist as a thunk on the stack, getting reduced only when it’s used. Whether this is actually a problem depends on the call site, compiler options, etc., so it’s often good practice to write your code in a form which avoids this issue altogether.

    A common solution is to create a tail-recursive form with a strict accumulator, but in this case you could use higher-order functions, like this:

    hammingDistance :: Motif -> Motif -> Int
    hammingDistance xs ys = length . filter (uncurry (==)) $ zip xs ys
    

    here’s the tail-recursive implementation, for comparison

    hammingDistance :: Motif -> Motif -> Int
    hammingDistance xs ys = go 0 xs ys
      where
        go !acc [] [] = acc
        go !acc xs [] = acc -- optimistic
        go !acc [] ys = acc -- optimistic
        go !acc (x:xs) (y:ys) = case (x == y) of
          True  -> go acc xs ys
          False -> go (acc+1) xs ys
    

    This uses the BangPatterns extension to force the accumulator to be strictly evaluated, otherwise it would have the same problem as your current definition.

    To directly answer some of your other questions:

    1. Pattern matching is trivial
    2. Whether you should use lists or arrays depends mostly on how the data is created and how it’s consumed. For this case, it’s possible that lists may be the best type. In particular, if your lists are all consumed as they’re created, and you don’t ever need the whole list in memory, they should be fine. If you do retain lists in memory though, they have a lot of space overhead.

    Usage patterns

    I think the way you use these functions does some extra work as well:

    (minimum . map (hammingDistance motif) . motifs l
    

    Since you only need the minimum hammingDistance, you may be calculating a lot of extra values which aren’t necessary. I can think of two solutions to this:

    Option 1. Define a new function hammingDistanceThresh :: Motif -> Int -> Motif -> Int, which stops when it exceeds the threshold. The slightly odd type ordering is to facilitate using it in a fold, like this:

    let motifs' = motifs l
    in foldl' (hammingDistanceThresh motif) (hammingDistance motif $ head motifs') (tail motifs')
    

    Option 2. If you define a lazy natural number type, you can use that instead of Ints for the result of hammingDistance. Then only as much of the hamming distance as necessary will be calculated.

    One final note: using -auto-all will very frequently generate much slower code than other profiling options. I would suggest you try using just -auto first, and then adding manual SCC annotations if necessary.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I am currently running into a problem where an element is coming back from
That's pretty much it. I'm using Nokogiri to scrape a web page what has
For some reason, after submitting a string like this Jack’s Spindle from a text
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
this is what i have right now Drawing an RSS feed into the php,
I have a French site that I want to parse, but am running into
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
We're building an app, our first using Rails 3, and we're having to build

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.