Recently I needed to compare two sets of historical data. Since sometimes a day or two was missing in one of them and I wanted to be precise, I decided to create a list of all possible dates and two lists of tuples containing dates and corresponding values belonging to both sets. Then I changed the latter lists to Maps in order to improve date lookups.
The idea was to try to find every date from the full dates list in both Mapped list and create a list of “triples” of (date, value1, value2) containing only dates where both data sets had a date and a value. Then I could write them to a file and properly compare them.
DON’T MIND THE CODE, IT’S INCLUDED ONLY FOR GOOD MEASURE
Here is the code (it is not optimal at all, but for that small task it did its job nicely):
import qualified Data.Map as M
import Data.List (transpose)
import Data.Maybe (fromJust)
main = do
dts <- readFile "dates.txt"
cts1 <- readFile "eu.txt"
cts2 <- readFile "usa.txt"
let
dates = lines dts
cols1 = transpose $ map words $ lines cts1
cols2 = transpose $ map words $ lines cts2
prs1 = zip (head cols1) (last cols1)
prs2 = zip (head cols2) (last cols2)
map1 = M.fromList prs1
map2 = M.fromList prs2
trips = map fromJust (filter (/=Nothing) (map (\date -> getTrips date map1 map2) dates))
cols3 = map (\(a,b,c) -> [a,b,c]) trips
result = unlines $ map unwords $ cols3
writeFile "trips.txt" result
getTrips :: String -> M.Map String String -> M.Map String String -> Maybe (String, String, String)
getTrips date map1 map2
| is1 /= Nothing && is2 /= Nothing = Just (date, fromJust is1, fromJust is2)
| otherwise = Nothing
where
is1 = M.lookup date map1
is2 = M.lookup date map2
TL;DR: The code worked (though I would gladly hear some opinions/advice), but I have some questions:
- there were only around 2000 dates, therefore I didn’t care much about performance (you can see that I was using
Strings everywhere); was usingData.Mapan overkill then? When shouldData.Mapbe preferred over lists of tuples? - the
Mapwas created from tuples ofStrings – is it fine or should the key always be numeric in order for the balancing and lookups to work properly?
You should use data structures that fit your problem and performance/programming-time constraints, so using a
Mapwas probably a good idea. Maybe in your case if your data was already ordered you could have doneNo, if your code typechecks your
Mapwill work properly (w/r/t the way you’ve definedOrdinstance). But as C. A. McCann suggests, a trie may be more appropriate if your keys are lists, especially if there is much overlap between key prefixes (look at how theOrdinstance on lists is implemented, and imagine the number of operations that have to take place to insert the keys “abcdx”, “abcdy”, and “abcdz” into aMapvs. a trie structure to convince yourself).