I’m trying to pre-process a text file, where each line is a bi-gram words of a document with their frequency in that document. here is an example of each line:
i_like 1 you_know 2 …. not_good 1
I managed to create the dictionary from the whole corpus.
Now I want to read the corpus line by line and having the dictionary, create the document-term matrix so each element (i,j) in matrix will be the frequency of term “j” in document “i”.
Create a function that generates an integer index for each word using a dictionary:
The result matrix is:
Processing each line of the text file generates one row of the matrix:
Your read the text file one line at a time, calling
ProcessLine()on each line and adding the resulting list to m_Matrix.