I have a set of text documents in java . I have to identify the most important document (just as what an expert would identify) using a computer.
eg. I have 10 books on java , the system identifies Java complete reference as the most important document or the most relevant.(based on similarities with the wikipedia page about java)
One method would be to have a reference document and find similarities between this document and the set of documents at hand (as mentioned in the previous example). And provide a result saying the one which has maximum similarity is the most important docuemnt .
I want to identify other more efficient methods of performing this.
please suggest other methods for finding the relevant document (in a unsupervised way if possible) .
I think another mechanism would be, have a dictionary of words and ranking map associated with each document.
For example, in Java complete reference book case, there will be a dictionary of keywords and its ranking.
Java-10
J2ee-5
J2SDK-10
Java5-10 etc.,
Note:If your documents are dynamic streams and names also dynamic, I am not sure how to handle it.