This paper contains confusion matrices for spelling errors in a noisy channel. It describes how to correct the errors based on conditional properties.
The conditional probability computation is on page 2, left column. In footnote 4, page 2, left column, the authors say: “The chars matrices can be easily replicated, and are therefore omitted from the appendix.” I cannot figure out how can they be replicated!
How to replicate them? Do I need the original corpus? or, did the authors mean they could be recomputed from the material in the paper itself?
Looking at the paper, you just need to calculate them using a corpus, either the same one or one relevant to your application.
In replicating the matrices, note that they implicitly define two different
charsmatrices: a vector and an n-by-n matrix. For each characterx, the vectorcharscontains a count of the number of times the characterxoccurred in the corpus. For each character sequencexy, the matrixcharscontains a count of the number of times that sequence occurred in the corpus.chars[x]represents a look-up ofxin the vector;chars[x,y]represents a look-up of the sequencexyin the matrix. Note thatchars[x]= the sum overchars[x,y]for each value ofy.Note that their counts are all based on the 1988 AP Newswire corpus (available from the LDC). If you can’t use their exact corpus, I don’t think it would be unreasonable to use another text from the same genre (i.e. another newswire corpus) and scale your counts such that they fit the original data. That is, the frequency of a given character shouldn’t vary too much from one text to another if they’re similar enough, so if you’ve got a corpus of 22 million words of newswire, you could count characters in that text and then double them to approximate their original counts.