I am working on an app that needs plagiarism detection. I am wondering if the new Search API or any other API in app engine (like prospective search) can be used in any way to do this task over millions of entities ?
If not, what is the proposed python library to do that ?
Specifically i need to detect similarity between solutions submitted to course homeworks. They could be programs or even texts but usually would not exceed a few paragraphs each.
I am aware of Winnowing algorithm (sequential hashing), but here the problem is to search millions of submissions for homeworks (not a few).
You can use the Fulltext Search API to search a corpus of documents; this is subject to the usual caveats of fulltext search: you can search on individual terms and on exact phrases, but there’s no ‘fuzziness’ built in – near matches won’t be returned (barring things like stemming, which treat ‘phrase’ and ‘phrased’ and ‘phrases’ as the same word).
Of course, plagiarism detection is a lot more complicated than just finding candidate documents. Your best option may be to use something like TF-IDF to find the most significant words in an input text, use the Fulltext search API to find a set of candidate documents containing those words, and then do a side-by-side comparison in memory on the candidates.