As I understand it, IDF is used to calculate how many documents have the term (sort of just the idea). You can calculate IDF (along with TF) in the training set since you have all the documents beforehand. But what if I don’t have the test set beforehand and I’m getting test documents in a sequential manner (like from a web crawler), then how am I going to calculate the IDF for words in a document when it comes to testing?
Share
For this state if your dataset is big enough you could using just training set for IDF. in the test phase if the new term be in train set use the IDF of training and if the term is new use the number of train set documents for calculate IDF.
For some purposes you could use smoothing methods for having better results.