I have crawled some data using nutch and managed to inject it into elasticsearch.

Question

0

Asked: May 31, 20262026-05-31T07:37:17+00:00 2026-05-31T07:37:17+00:00

I have crawled some data using nutch and managed to inject it into elasticsearch.

0

I have crawled some data using nutch and managed to inject it into elasticsearch. But I have one problem: If I inject the crawled data again it will create duplicates. Is there any way of disallowing this?

Has anyone managed to solve this or have any suggestions on how to solve it?

/Samus

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T07:37:18+00:00

If you index each page/document crawled with the same id in ElasticSearch it won’t duplicate it. You could use a checksum/hash function to turn the page’s URL into a distinct ID.

You can also use Operation_type to ensure that if that id is already indexed it should not reindex it:

The index operation also accepts an op_type that can be used to force
a create operation, allowing for “put-if-absent” behavior. When create
is used, the index operation will fail if a document by that id
already exists in the index.

ElasticSearch index API

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have crawled some data using nutch and managed to inject it into elasticsearch.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply