I’m trying to pre-process a text file, where each line is a bi-gram words

Question

0

Asked: June 5, 20262026-06-05T07:08:11+00:00 2026-06-05T07:08:11+00:00

I’m trying to pre-process a text file, where each line is a bi-gram words

0

I’m trying to pre-process a text file, where each line is a bi-gram words of a document with their frequency in that document. here is an example of each line:

i_like 1 you_know 2 …. not_good 1

I managed to create the dictionary from the whole corpus.
Now I want to read the corpus line by line and having the dictionary, create the document-term matrix so each element (i,j) in matrix will be the frequency of term “j” in document “i”.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T07:08:12+00:00

Create a function that generates an integer index for each word using a dictionary:

Dictionary<string, int> m_WordIndexes = new Dictionary<string, int>();

int GetWordIndex(string word)
{
  int result;
  if (!m_WordIndexes.TryGet(word, out result)) {
    result = m_WordIndexes.Count;
    m_WordIndexes.Add(word, result);
  }
  return result;
}

The result matrix is:

List<List<int>> m_Matrix = new List<List<int>>();

Processing each line of the text file generates one row of the matrix:

List<int> ProcessLine(string line)
{
  List<int> result = new List<int>();
  . . . split the line in a sequence of word / number of occurences . . . 
  . . . for each word / number of occurences . . .{
    int index = GetWordIndex(word);      
    while (index > result.Count) {
      result.Add(0);
    }  
    result.Insert(index, numberOfOccurences);
  }
  return result;
}

Your read the text file one line at a time, calling ProcessLine() on each line and adding the resulting list to m_Matrix.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to pre-process a text file, where each line is a bi-gram words

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply