Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 251585
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T21:36:24+00:00 2026-05-11T21:36:24+00:00

For purposes of testing compression, I need to be able to create large files,

  • 0

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.

  • The content of the files should be neither completely random nor uniform.
    A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good – the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
  • The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
  • I’d like to keep the number of files at a manageable level, let’s say o(10).

For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:

Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
    while (bytesRemaining > 0)
    {
        int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
        if (!zeroes) _rnd.NextBytes(buffer);
        fileStream.Write(buffer, 0, sizeOfChunkToWrite);
        bytesRemaining -= sizeOfChunkToWrite;
    }
    fileStream.Close();
}

With a large enough buffer, let’s say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.

For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.

Neither of these is quite satisfactory for me.

I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.

The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don’t know of any that are sufficiently large.

What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.

Currently I have an approach that sort of works but it takes too long to run.

Has anyone else solved this?

Is there a much faster way to write a text file than via StreamWriter?

Suggestions?

EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T21:36:24+00:00Added an answer on May 11, 2026 at 9:36 pm

    I think you might be looking for something like a Markov chain process to generate this data. It’s both stochastic (randomised), but also structured, in that it operates based on a finite state machine.

    Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however – to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to “train” your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 134k
  • Answers 134k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer socketReceive.Bind(new IPEndPoint(IPAddress.Any, Port)); By binding to all your IP addresses,… May 12, 2026 at 6:45 am
  • Editorial Team
    Editorial Team added an answer You haven't named your function getState.... it doesn't know where… May 12, 2026 at 6:45 am
  • Editorial Team
    Editorial Team added an answer I just analysed the following code in .NET Reflector: public… May 12, 2026 at 6:45 am

Related Questions

I've been using yuicompressor.jar on my test server for on-the-fly minimisation of changed JavaScript
I am writing unit tests with C#, NUnit and Rhino Mocks. Here are the
I'm trying to unit test a method that performs a fairly complex operation, but
I've recently become quite interested in identifying patterns for software scalability testing. Due to

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.