I have two very large(30000+ documents) collections, one contains words extracted from a text file(collection name ‘word’) and one contains words from a dictionary(collection name ‘dictionary’).
How can I get the words that exist in both collections?
(I’ve simplified the situation, documents inside the ‘word’ collection contain metadata about the words, so each word has to be a separate document.)
Copy both collections into a single collection (include a discriminator field if necessary so you can tell what kind of document you have in each instance).
Run map-reduce on that collection
In Map, emit the word as the key and a value, say
{instance:1, dict:0}or{instance:0, dict:1}depending on whether the document being mapped is an instance or a dictionary entry. (You could add more fields here into the values as necessary.)In Reduce, accumulate the scores (as usual).
Now do a query looking for
instance > 0anddict > 0and you have all of the words that are in both.