I have two text documents and want to get the word matches between the two documents. The words can match anywhere – for instance, word#5 of doc1 can match word#3 and word#67 of doc2; and then word#23 of doc1 can again match word#3 and word#67 of doc2 – so I want all the matches. Also, aside from one-word matches I want to similarly get consecutive multiple (2-word, 3-word ….15-word etc) word matches between the two documents. How should I approach this in Java? I have been looking at regular expressions but am still not convinced on the exact approach.
Share
First, you need to split the document into bunches of n words (1 word, 2 words, 3 words, …, n-words) – those bunches are called n-grams. Refer here.
Secondly, create a Set of n-grams from document A. Then, for each n-gram from document B, check if it is in the set.