We have a set of Documents , each has a set of Features.
Given feature A, we need to know what is the probability of having feature B in the same document.
I thought of building a probability matrix , s.t:
M(i,j) = Probability of having feature B in a document , given that feature A is there.
However , we have an additional requirement:
Given feature A is in the document , what are all the features that have a probability > P of being in the same document.
In the mean while all I could think off is a sparse matrix for the Probability matrix , and after it’s computed , for each feature run over all the column , sort it by P , and keep it in a linked list somewhere. (So now , we have for each feature , a list of corresponding features
This space complexity is quite big (worst case: N^2, and N is large!) , and the time complexity for each search is O(N).
Any better idea?
If the number of features is comparable with the number of documents, or greater, consider holding an inverted index: for each feature hold (e.g. a sorted list of) the documents in which it is present. You can then work out the probability of B given A by running a merge on the sorted lists for features A and B.
For the “common features expected given A” question, I can think of nothing better than pre-computing the answer for each A and hoping that the resulting list of features isn’t too long.