I am having clarification regarding using Hadoop for large file size around 2 million.

Question

0

Asked: June 2, 20262026-06-02T17:33:36+00:00 2026-06-02T17:33:36+00:00

I am having clarification regarding using Hadoop for large file size around 2 million.

0

I am having clarification regarding using Hadoop for large file size around 2 million. I have file data that consists of 2 million lines for which I want to split each line as single file, copy it in Hadoop File System and do perform calculation of term frequency using Mahout. Mahout uses map-reduce computation in a distributed fashion. But for this, say If I have a file that consist of 2 million lines, I want to take each line as a document for calculation of term-frequency. I will finally have one directory where I will have 2 million documents, each document consist of single line. Will this create n-maps for n-files, here 2 million maps for the process. This takes lot of time for computation. Is there is any alternative way of representing documents for faster computation.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T17:33:37+00:00

Editorial Team

2026-06-02T17:33:37+00:00Added an answer on June 2, 2026 at 5:33 pm

2 millions files is a lot for hadoop. More then that – running 2 million tasks will have roughly 2M seconds overhead, what means a few days of small cluster work.
I think that the problem is of algorithmic nature – how to map your computation to the map reduce paradigm in the way that you will have modest number of mappers. Please drop a few lines about task you need, and I might suggest algorithm.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am having clarification regarding using Hadoop for large file size around 2 million.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply