I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:
- Create an empty immutable map
- Create a scanner that reads through the file.
-
While the scanner.hasNext() is true:
- Check if the Map contains the word, if it doesn’t contain the word, initialize the count to zero
- Create a new entry with the key=word and the value=count+1
- Update the map
-
At the end of the iteration, the map is populated with all the values.
My code is as follows:
val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
val currentCount = wordMap.getOrElse(token,0) + 1
val wordMap = wordMap + (token,currentCount)
}
The ides is that wordMap will have all the wordCounts at the end of the iteration…
Whenever I try to run this snippet, I get the following exception
recursive value wordMap needs type.
Can somebody point out why I am getting this exception and what can I do to remedy it?
Thanks
This line is redefining an already-defined variable. If you want to do this, you need to define
wordMapwithvarand then just useThough how about this instead?:
Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.
If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala’s parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).
EDIT: Detailed explanation:
Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:
Now
identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:And finally, we want to count up the number of tokens for each type:
Since we map this over all the lines in the file, we end up with token counts for each line.
So what does
flatMapdo? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.Assume the file is:
Then we get:
So now we need to combine the counts of each line into one big set of counts. The
countsByLinevariable is anIterator, so it doesn’t have agroupBymethod. Instead we can convert it to aStream, which is basically a lazy list. We want laziness because we don’t want to have to read the entire file into memory before we start. Then thegroupBygroups all counts of the same word type together.And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:
And there you have it.