This was an interview question that someone asked me and I didn’t really have a good answer. I was wondering if someone could possibly help me understand the solution to this:
“You have a stream of billion tweets coming in. How will you figure out the top 10 hashtags ? “
Thanks
Create a map, with a hashtag as the key and a counter as a value.
Increment the counter of each tag in each tweet you receive.
Examine the value of the counters to find the top 10.
Your phrasing of the question doesn’t include any constraints that would prohibit this straightforward solution. In an interview situation, I would have asked clarifying questions to elicit these constraints.
Under constraints like, “it has to run in linear time,” and, “it has to use a constant amount of memory,” much more interesting answers emerge.
I am not sure if there is a constant memory solution to the problem as posed, but I know one for a related (and often more useful) problem: identifying elements that constitute a given fraction of results. I gave it as an answer to a similar question.
(I say, “more useful”, because if the total fraction of a given item falls below a threshold, it’s more likely to be noise than true “Top 10” material.)