If we have a huge string data in a file, we can normally use

Question

0

Asked: June 9, 20262026-06-09T20:00:37+00:00 2026-06-09T20:00:37+00:00

If we have a huge string data in a file, we can normally use

0

If we have a huge string data in a file, we can normally use algorithm(s), say (hash + heap) or (trie + heap), etc etc to efficiently find the top ‘k’ words with high frequency. How do I do this if I have a huge amount of string data in my ‘database’. Right now the only way I know is to query the entire data set and then implement the frequency operations on it. But querying the huge data set is a very costly operation. Is there any efficient/better way to do this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T20:00:38+00:00

Finding information on huge data is done by parallelizing it and use a cluster rather then a single machine.

What you are describing is a classic map-reduce problem, that can be handled using the following functions (in pseudo code):

map(doc):
  for each word in doc:
      emitIntermediate(word,"1")
reduce(list<word>):
  emit(word,size(list))

The map reduce framework, which is implemented in many languages – allows you to easily scale the problem and use a huge cluster without much effort, taking care of failures and workers management for you.

In here: doc is a single document, it usually assumes a collection of documents. If you have only one huge document, you can of course split it to smaller documents and invoke the same algorithm.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

If we have a huge string data in a file, we can normally use

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply