I’m still not proficient with scala, but I’m using it to process some data,

Question

0

Asked: June 14, 20262026-06-14T23:02:06+00:00 2026-06-14T23:02:06+00:00

I’m still not proficient with scala, but I’m using it to process some data,

0

I’m still not proficient with scala, but I’m using it to process some data, which I read into from a file into the following data structure:

Map[Id, (Set[Category], Set[Tag])]

where

type Id = String

type Category = String

type Tag = String

Essentially, each key in the Map is the unique id of an entity that is associated with a set of categories and a set of tags.

My question is: which is the best ( = most efficient and most idiomatic) way to compute:

tags frequencies across all entities (type TagsFrequencies = Map[Tag, Double])
tags frequencies per category (Map[Category, TagsFrequencies])

Here is my attempt:

def tagsFrequencies(tags: List[Tag]): TagsFrequencies =
  tags.groupBy(t => t).map(
    kv => (kv._1 -> kv._2.size.toDouble / tags.size.toDouble))

def computeTagsFrequencies(data: Map[Id, (Set[Category], Set[Tag])]): TagsFrequencies = {
  val tags = data.foldLeft(List[Tag]())(
    (acc, kv) => acc ++ kv._2._2.toList)
  tagsFrequencies(tags)
}

def computeTagsFrequenciesPerCategory(data: Map[Id, (Set[Category], Set[Tag])]): Map[Category, TagsFrequencies] = {

  def groupTagsPerCategory(data: Map[Id, (Set[Category], Set[Tag])]): Map[Category, List[Tag]] =
    data.foldLeft(Map[Category, List[Tag]]())(
      (acc, kv) => kv._2._1.foldLeft(acc)(
        (a, category) => a.updated(category, kv._2._2.toList ++ a.getOrElse(category, Set.empty).toList)))

  val tagsPerCategory = groupTagsPerCategory(data)
  tagsPerCategory.map(tpc => (tpc._1 -> tagsFrequencies(tpc._2)))
}

As an example, consider

val data = Map(
  "id1" -> (Set("c1", "c2"), Set("t1", "t2", "t3")),
  "id2" -> (Set("c1"), Set("t1", "t4")))

then:

tags frequencies across all entities is:

Map(t3 -> 0.2, t4 -> 0.2, t1 -> 0.4, t2 -> 0.2)

and tags frequencies per category is:

Map(c1 -> Map(t3 -> 0.2, t4 -> 0.2, t1 -> 0.4, t2 -> 0.2), c2 -> Map(t3 -> 0.3333333333333333, t1 -> 0.3333333333333333, t2 -> 0.3333333333333333))

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T23:02:07+00:00

Here’s a rewrite for idiom, not necessarily efficiency. I’d make your first method a little more general (the Iterable argument), use identity instead of t => t, and use mapValues:

def tagsFrequencies(tags: Iterable[Tag]): TagsFrequencies =
  tags.groupBy(identity).mapValues(_.size / tags.size.toDouble)

Because this now takes any Iterable[Tag], you can use it to clean up the second method:

def computeTagsFrequencies(data: Map[Id, (Set[Category], Set[Tag])]) =
  tagsFrequencies(data.flatMap(_._2._2))

And similarly for the last method:

def computeTagsFrequenciesPerCategory(data: Map[Id, (Set[Category], Set[Tag])]) =
  data.values.flatMap {
    case (cs, ts) => cs.map(_ -> ts)
  }.groupBy(_._1).mapValues(v => tagsFrequencies(v.flatMap(_._2)))

None of these changes should affect performance in any meaningful way, but you should of course benchmark in your own application.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m still not proficient with scala, but I’m using it to process some data,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply