I am writing a program that generates random text based on the Markov model.

Question

0

Asked: June 10, 20262026-06-10T11:55:01+00:00 2026-06-10T11:55:01+00:00

I am writing a program that generates random text based on the Markov model.

0

I am writing a program that generates random text based on the Markov model. I am running into a problem, with some files that have a lot of spaces in between words, the initial seed is seen to be a space. The problem is that all the next characters are seen as spaces as well and so the random text that is generated is just a blank documents as nextChosenChar is always a space.

Can someone suggest some solution to this problem?

I tried to come up with a solution as seen the latter part of the code below, but to no avail.

char ChooseNextChar(string seed, int order, string fileName){
    Map<string, Vector<char> > nextCharMap;
    ifstream inputStream;
    inputStream.open(fileName.c_str());
    int offset = 0;
    Vector<char> charsFollingSeedVector;
    inputStream.clear();
    char* buffer = new char [order + 1];
    char charFollowingSeed;
    static int consecutiveSpaces = 0;
    while (!inputStream.eof()) {    
        inputStream.seekg(offset);
        inputStream.read(buffer, order + 1);
        string key(buffer, order);
        if (equalsIgnoreCase(key, seed)) {
            //only insert key if not present otherwise overwriting old info 
            if (!nextCharMap.containsKey(seed)) {
                nextCharMap.put(seed, charsFollingSeedVector);
            }
            //read the char directly following seed
            charFollowingSeed = buffer[order];
            nextCharMap[seed].push_back(charFollowingSeed);
        }
        offset++;
    }
    //case where no chars following seed
    if (nextCharMap[seed].isEmpty()) {
        return EOF;
    }
    //determine which is the most frequent following char
    char nextChosenChar = MostFequentCharInVector(seed, nextCharMap);

    //TRYING TO FIX PROBLEM OF ONLY OUTPUTTING SPACES**********
     if (nextChosenChar == ' ') {
        consecutiveSpaces++;
        if (consecutiveSpaces >= 1) {
            nextChosenChar = nextCharMap[seed].get(randomInteger(0, nextCharMap[seed].size()-1));
            consecutiveSpaces = 0;
        }
    }
    return nextChosenChar;
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T11:55:02+00:00

If you really want a character-based model, you won’t get very natural looking text as output, but it is definitely possible, and that model will fundamentally be able to deal with sequences of space characters as well. There is no need to remove them from the input if you consider them a natural part of the text.

What is important is that a Markov model does not always fall back to predicting the one character that has the highest probability at any given stage. Instead, it must look at the entire probability distribution of possible characters, and chooses one randomly.

Here, randomly means it picks a character not pre-determined by the programmer. Still, the random distribution is not the uniform distribution, i.e. not all characters are equally likely. It has to take into account the relative probabilities of the various possible characters. One way to do this is to generate a cumulative probability distribution of characters, i.e. for example, if the probabilities are

p('a') == 0.2
p('b') == 0.4
p('c') == 0.4

we represent them as

p('a') == 0.2
p('b') == p('a') + 0.4 == 0.6
p('c') == p('a') + p('b') == 1.0

Then to generate a random character, we first generate a uniformly distributed random number N between 0 and 1, and then choose the first character whose cumulative probability is no less than N.

I have implemented this in the example code below. The train() procedure generates a cumulative probability distribution of the following-characters, for every character in the training input. The ‘predict()’ procedure applies this to generate random text.

For a full implementation, this still lacks:

A representation of the probability distribution for the initial character. As you see in the ‘main()’ function, my output simply always starts with ‘t’.
A representation of the length of the output string, or the final character. ‘main()’ simply always generates a string of length 100.

The code was tested with GCC 4.7.0 (C++11 option) on Linux. Example output below.

#include <iostream>
#include <string>
#include <vector>
#include <utility>
#include <map>
#include <numeric>
#include <algorithm>
#include <random>

template <typename Char>
class Markov
{
public:
  /* Data type used to count the frequencies (integer!) of
     characters. */
  typedef std::map<Char,unsigned>            CharDistributionMap;

  /* Data type used to represent a cumulative probability (float!)
     distribution. */
  typedef std::vector<std::pair<Char,float>> CharDistribution;

  /* Data type used to represent the Markov model. Each character is
     mapped to a probality distribution of the characters that follow
     it. */
  typedef std::map<Char,CharDistribution>    MarkovModel;


  /* The model. */
  MarkovModel  _model;

  /* Training procedure. */
  template <typename Iterator>
  void train(Iterator from, Iterator to)
  {
    _model = {};
    if (from == to)
      return;

    std::map<Char,CharDistributionMap> proto_model {};

    /* Count frequencies. */
    Char current = *from;
    while (true) {
      ++from;
      if (from == to)
        break;
      Char next = *from;
      proto_model[current][next] += 1;
      current = next;
    }

    /* Transform into probability distribution. */
    for (const auto &entry : proto_model) {
      const Char current              = entry.first;
      const CharDistributionMap &freq = entry.second;

      /* Calculate total frequency of current character. */
      unsigned total =
         std::accumulate(std::begin(freq),std::end(freq),0,
           [](unsigned res,const std::pair<Char,unsigned> &p){
                   return res += p.second;
               });

      /* Determine the probability distribution of characters that
         follow the current character. This is calculated as a cumulative
         probability. */
      CharDistribution dist {};
      float probability { 0.0 };
      std::for_each(std::begin(freq),std::end(freq),
             [total,&probability,&dist](const std::pair<Char,unsigned> &p){
                   // using '+=' to get cumulative probability:
                   probability += static_cast<float>(p.second) / total; 
                   dist.push_back(std::make_pair(p.first,probability));
             });

      /* Add probability distribution for current character to the model. */
      _model[current] = dist;
    }
  }


  /* Predict the next character, assuming that training has been
     performed. */
  template <typename RandomNumberGenerator>
  Char predict(RandomNumberGenerator &gen, const Char current)
  {
    static std::uniform_real_distribution<float> generator_dist { 0, 1 };

    /* Assume that the current character is known to the model. Otherwise,
       an std::out_of_range exception will be thrown. */
    const CharDistribution &dist { _model.at(current) };

    /* Generate random number between 0 and 1. */
    float random { generator_dist(gen) };

    /* Identify the character that has the greatest cumulative probabilty
       smaller than the random number generated. */
    auto res =
         std::lower_bound(std::begin(dist),std::end(dist),
                          std::make_pair(Char(),random),
             [](const std::pair<Char,float> &p1, const std::pair<Char,float> &p2) {
                    return (p1.second < p2.second);
             });
    if (res == std::end(dist))
      throw "Empty probability distribution. This should not happen.";
    return res->first;
  }

};

int main()
{
  /* Initialize random-number generator. */
  std::random_device rd;
  std::mt19937 gen(rd());


  std::string input { "this   is    some   input text   with   many spaces." };

  if (input.empty())
    return 1;

  /* We append the first character to the end, to ensure that even the
     last character of the text gets a non-empty probability
     distribution. A more proper way of dealing with character that
     have empty distributions would be _smoothing_. */
  input += input[0];

  Markov<char> markov {};
  markov.train(std::begin(input),std::end(input));

  /* We set the initial character. In a real stochastic model, there
     would have to be a separate probality distribution for initial
     character and we would choose the initial character randomly,
     too. */
  char current_char { 't' };

  for (unsigned i = 0 ; i < 100 ; ++i) {
    std::cout << current_char;
    current_char = markov.predict(gen,current_char);
  }
  std::cout << current_char << std::endl;
}

Some example output generated by this program:

t  mext s.t th   winy  iny  somaces      sputhis inpacexthispace te  iny            me   mext mexthis

tes    is  manputhis.th is  wis.th with it    is  is.t  s   t   winy    it mext    is        ispany

this  maces      somany  t    s        it this  winy sputhisomacext manput    somanputes  macexte iso

t   wispanpaces maces  tesomacexte s  s  mes.th     isput t wit   t   somanputes   s  withit  sput ma

As you can see, the distribution of space characters follows, sort of naturally, the distribution found in the input text.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a program that generates random text based on the Markov model.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply