I have crawled some data using nutch and managed to inject it into elasticsearch. But I have one problem: If I inject the crawled data again it will create duplicates. Is there any way of disallowing this?
Has anyone managed to solve this or have any suggestions on how to solve it?
/Samus
If you index each page/document crawled with the same id in ElasticSearch it won’t duplicate it. You could use a checksum/hash function to turn the page’s URL into a distinct ID.
You can also use Operation_type to ensure that if that id is already indexed it should not reindex it:
ElasticSearch index API