I am new to haskell and I encountered a performance issue that is so grave that it must be my code and not the haskell platform.
I have a python implementation of the Levenshtein distance (own code) and I passed (or tried to do so) this to haskell. The result is the following:
bool2int :: Bool -> Int
bool2int True = 1
bool2int False = 0
levenshtein :: Eq a => [a] -> [a] -> Int -> Int -> Int
levenshtein u v 0 0 = 0
levenshtein u v i 0 = i
levenshtein u v 0 j = j
levenshtein u v i j = minimum [1 + levenshtein u v i (j - 1),
1 + levenshtein u v (i - 1) j,
bool2int (u !! (i - 1) /= v !! (j - 1) ) + levenshtein u v (i - 1) (j - 1) ]
distance :: Eq a => [a] -> [a] -> Int
distance u v = levenshtein u v (length u) (length v)
Now, the difference in execution time for strings of length 10 or more is of various powers of 10 between python and haskell. Also with some rough time measuring (wall clock, as I haven’t found a clock() command in haskell so far) it seems that my haskell implementation has not cost O(mn), but some other exorbitantly fast growing cost.
Nota bene: I do not want my haskell implementation to compete speed wise with the python script. I just want it to run in a “sensible” time and not in multiples of the time the whole universe exists.
Questions:
- What am I doing wrong, that my implementation is so darn slow?
- How to fix it?
- Talking about “lazy evaluation”: I gather that if
levenshtein "cat" "kit" 2 2is called thrice, it is only calculated once. Is this right? - There must be something built-in for my bool2int, right?
- Any other input is highly appreciated if it shoves me ahead on the rough path to mastering haskell.
EDIT: Here goes the python code for comparison:
#! /usr/bin/python3.2
# -*- coding, utf-8 -*-
class Levenshtein:
def __init__ (self, u, v):
self.__u = ' ' + u
self.__v = ' ' + v
self.__D = [ [None for x in self.__u] for x in self.__v]
for x, _ in enumerate (self.__u): self.__D [0] [x] = x
for x, _ in enumerate (self.__v): self.__D [x] [0] = x
@property
def distance (self):
return self.__getD (len (self.__v) - 1, len (self.__u) - 1)
def __getD (self, i, j):
if self.__D [i] [j] != None: return self.__D [i] [j]
self.__D [i] [j] = min ( [self.__getD (i - 1, j - 1) + (0 if self.__v [i] == self.__u [j] else 1),
self.__getD (i, j - 1) + 1,
self.__getD (i - 1, j) + 1] )
return self.__D [i] [j]
print (Levenshtein ('first string', 'second string').distance)
Your algorithm has exponential complexity. You seem to be assuming that the calls are being memoized for you, but that’s not the case.
You’ll need to add explicit memoization, possibly using an array or some other method.
No, Haskell does not do automatic memoization. Laziness means that if you do
let y = f x in y + y, then thef xwill only be evaluated (once) if the result of the sum is demanded. It does not mean thatf x + f xwill evaluate in only one call tof x. You have to be explicit when you want to share results from subexpressions.Yes, there is an
instance Enum Bool, so you can usefromEnum.While writing stuff from scratch may be fun and educational, it is important to learn to take advantage of the many great libraries on Hackage when doing common things like this.
For example there is an implementation of the Levenshtein distance in the edit-distance package.
I translated your Haskell code back to Python for comparison:
Even without fixing the O(n) indexing issue that chrisdb pointed out in his answer, this performs slower than the Haskell version when compiled:
Of course, they both lose to the properly memoized version in the edit-distance package:
Here’s a simple memoized implementation using
Data.Array:It performs similarly to your original Python code: