I have a large list of phrases (single and multiple words; some overlapping) and I have a lot of documents. In the end I only want to store a list of phrases (from the large phrase list) per document and not the whole documents. What’s an efficient way to achive this? (preferably in python)
example:
phrase_list = ['cat', 'dog', 'tree', 'tree house'] // actually a few thousend if not million
// a list of a few thousend documents with longer text
doc_dictionary = {'doc1':"""the cat sat under the tree""",
'doc2':"""the dog chased the cat""",
'doc3':"""the boy loves his tree house"",}
result_dict = {'doc1': ['cat','tree'], 'doc2': ['dog', 'cat'], 'doc3': ['tree house']}
Sounds like you need an indexer and search engine, like Lucene for Java. Perhaps the PyLucene port will be helpful.