I’m trying to recreate Hadoop’s word count map / reduce logic in a simple Scala program for learning
This is what I have so far
val words1 = "Hello World Bye World"
val words2 = "Hello Hadoop Goodbye Hadoop"
val input = List(words1,words2)
val mapped = input.flatMap(line=>line.split(" ").map(word=>word->1))
//> mapped : List[(String, Int)] = List((Hello,1), (World,1), (Bye,1),
// (World,1), (Hello,1), (Hadoop,1),
// (Goodbye,1), (Hadoop,1))
mapped.foldLeft(Map[String,Int]())((sofar,item)=>{
if(sofar.contains(item._1)){
sofar.updated(item._1, item._2 + sofar(item._1))
}else{
sofar + item
}
})
//>Map(Goodbye -> 1, Hello -> 2, Bye -> 1, Hadoop -> 2, World -> 2)
This seems to work, but I’m sure there is a more idiomatic way to handle the reduce part (foldLeft)
I was thinking about perhaps a multimap, but I have a feeling Scala has a way to do this easily
Is there? e.g. a way to add to a map, and if the key exists, instead of replacing it, adding the value to the existing value. I’m sure I’ve seen this quesion somewhere, but couldn’t find it and neither the answer.
I know groupBy is the way to do it probably in the real world, but I’m trying to implement it as close as possible to the original map/reduce logic in the link above.
You can use Scalaz’s
|+|operator becauseMapsare part of the Semigroup typeclass:The
|+|operator is the Monoidmappendfunction (a Monoid is any “thing” that can be “added” together. Many things can be added together like this: Strings, Ints, Maps, Lists, Options etc. An example:So in your case, rather then create a
List[(String,Int)], create aList[Map[String,Int]], and then sum them: