I’m doing some testing with nutch and hadoop and I need a massive amount

Question

0

Asked: May 27, 20262026-05-27T23:39:57+00:00 2026-05-27T23:39:57+00:00

I’m doing some testing with nutch and hadoop and I need a massive amount

0

I’m doing some testing with nutch and hadoop and I need a massive amount of data.
I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.

The problem is that I don’t have this amount of data, so I’m thinking of ways to produce it.

The data itself can be of any kind.
One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).

Another idea is to write a program that will create files with dummy data.

Any other idea?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T23:39:58+00:00

This may be a better question for the statistics StackExchange site (see, for instance, my question on best practices for generating synthetic data).

However, if you’re not so interested in the data properties as the infrastructure to manipulate and work with the data, then you can ignore the statistics site. In particular, if you are not focused on statistical aspects of the data, and merely want “big data”, then we can focus on how one can generate a large pile of data.

I can offer several answers:

If you are just interested in random numeric data, generate a large stream from your favorite implementation of the Mersenne Twister. There is also /dev/random (see this Wikipedia entry for more info). I prefer a known random number generator, as the results can be reproduced ad nauseam by anyone else.
For structured data, you can look at mapping random numbers to indices and create a table that maps indices to, say, strings, numbers, etc., such as one might encounter in producing a database of names, addresses, etc. If you have a large enough table or a sufficiently rich mapping target, you can reduce the risk of collisions (e.g. same names), though perhaps you’d like to have a few collisions, as these occur in reality, too.
Keep in mind that with any generative method you need not store the entire data set before beginning your work. As long as you record the state (e.g. of the RNG), you can pick up where you left off.
For text data, you can look at simple random string generators. You might create your own estimates for the probability of strings of different lengths or different characteristics. The same can go for sentences, paragraphs, documents, etc. – just decide what properties you’d like to emulate, create a “blank” object, and fill it with text.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m doing some testing with nutch and hadoop and I need a massive amount

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply