I am writing a program that generates random text based on the Markov model. I am running into a problem, with some files that have a lot of spaces in between words, the initial seed is seen to be a space. The problem is that all the next characters are seen as spaces as well and so the random text that is generated is just a blank documents as nextChosenChar is always a space.
Can someone suggest some solution to this problem?
I tried to come up with a solution as seen the latter part of the code below, but to no avail.
char ChooseNextChar(string seed, int order, string fileName){
Map<string, Vector<char> > nextCharMap;
ifstream inputStream;
inputStream.open(fileName.c_str());
int offset = 0;
Vector<char> charsFollingSeedVector;
inputStream.clear();
char* buffer = new char [order + 1];
char charFollowingSeed;
static int consecutiveSpaces = 0;
while (!inputStream.eof()) {
inputStream.seekg(offset);
inputStream.read(buffer, order + 1);
string key(buffer, order);
if (equalsIgnoreCase(key, seed)) {
//only insert key if not present otherwise overwriting old info
if (!nextCharMap.containsKey(seed)) {
nextCharMap.put(seed, charsFollingSeedVector);
}
//read the char directly following seed
charFollowingSeed = buffer[order];
nextCharMap[seed].push_back(charFollowingSeed);
}
offset++;
}
//case where no chars following seed
if (nextCharMap[seed].isEmpty()) {
return EOF;
}
//determine which is the most frequent following char
char nextChosenChar = MostFequentCharInVector(seed, nextCharMap);
//TRYING TO FIX PROBLEM OF ONLY OUTPUTTING SPACES**********
if (nextChosenChar == ' ') {
consecutiveSpaces++;
if (consecutiveSpaces >= 1) {
nextChosenChar = nextCharMap[seed].get(randomInteger(0, nextCharMap[seed].size()-1));
consecutiveSpaces = 0;
}
}
return nextChosenChar;
}
If you really want a character-based model, you won’t get very natural looking text as output, but it is definitely possible, and that model will fundamentally be able to deal with sequences of space characters as well. There is no need to remove them from the input if you consider them a natural part of the text.
What is important is that a Markov model does not always fall back to predicting the one character that has the highest probability at any given stage. Instead, it must look at the entire probability distribution of possible characters, and chooses one randomly.
Here, randomly means it picks a character not pre-determined by the programmer. Still, the random distribution is not the uniform distribution, i.e. not all characters are equally likely. It has to take into account the relative probabilities of the various possible characters. One way to do this is to generate a cumulative probability distribution of characters, i.e. for example, if the probabilities are
we represent them as
Then to generate a random character, we first generate a uniformly distributed random number N between 0 and 1, and then choose the first character whose cumulative probability is no less than N.
I have implemented this in the example code below. The
train()procedure generates a cumulative probability distribution of the following-characters, for every character in the training input. The ‘predict()’ procedure applies this to generate random text.For a full implementation, this still lacks:
The code was tested with GCC 4.7.0 (C++11 option) on Linux. Example output below.
Some example output generated by this program:
As you can see, the distribution of space characters follows, sort of naturally, the distribution found in the input text.