I’ve recently written some Scala code which processes a String, finding all its sub-strings and retaining a list of those which are found in a dictionary. The start and end of the sub-strings within the overall string also have to be retained for later use, so the easiest way to do this seemed to be just to use nested for loops, something like this:
for (i <- 0 until word.length)
for (j <- i until word.length) {
val sub = word.substring(i, j + 1)
// lookup sub in dictionary here and add new match if found
}
As an exercise, I decided to have a go at doing the same thing in Haskell. It seems straightforward enough without the need for the sub-string indices – I can use something like this approach to get the sub-strings, then call a recursive function to accumulate the matches. But if I want the indices too it seems trickier.
How would I write a function which returns a list containing each continuous sub-string along with its start and end index within the “parent” string?
For example tokens "blah" would give [("b",0,0), ("bl",0,1), ("bla",0,2), ...]
Update
A great selection of answers and plenty of new things to explore. After messing about a bit, I’ve gone for the first answer, with Daniel’s suggestion to allow the use of [0..].
data Token = Token String Int Int
continuousSubSeqs = filter (not . null) . concatMap tails . inits
tokenize xs = map (\(s, l) -> Token s (head l) (last l)) $ zip s ind
where s = continuousSubSeqs xs
ind = continuousSubSeqs [0..]
This seemed relatively easy to understand, given my limited Haskell knowledge.
Works like this: