Here is a situation I have encountered: I have two similair java application running on different servers. Both applications obtain data from the same website using web-service provided. But the site doesn’t know of course that the first app has taken the same peace of data as the second app. After fetching data should be saved in database. So I have a problem of saving the same data two times in a database.
How can I avoid duplicate entries in my db?
Probably there are two ways:
1) use database side. write something that looks like “insert if unique”.
2) use server side. write some intermediate service that will receive responses from two data fetchers and process them somehow.
I suppose second solution is more effecient.
Can you advice something on this topic?
How would you implement that intermediate service? How would implement communication between the services? If we would use the HashMaps to store received data, how can we estimate maximum size of HashMap that our system can handle?
Do you really need to fetch data at two servers simultaneously? Checking every entry during insert if not present could be expensive. Merging several fetches can be time consuming as well. Is there any benefit of fetching in parallel? Consider having one fetcher at time.
The problem you will face is that you have to choose which one of you distributed processes should perform data fetching and storing it in DB.
It is some kind of Leader Election problem.
Take a look at Apache ZooKeeper which is distributed coordination service.
There is a receipt how to implement leader election with ZooKeeper.
There are a lot of frameworks that already implemented this receipt. I’d recommend you to use Netflix curator. More details about the leader election with curator is available at wiki.