I’ve been arguing with my programmer about the best way of going about this. We have data that comes in at a rate of about 10000 objects per second. This needs to be processed asynchronously, but loose ordering is sufficient, so each object is inserted round-robin-ly into one of several message queues (there are also several producers and consumers). Each object is ~300 bytes. And it needs to be durable, so the MQs are configured to persist to disk.
The problem is that often these objects are duplicated (as in they are unavoidably duplicated in the data that comes in to the producer). They do have 10-byte unique ids. It’s not catastrophic if objects are duplicated in the queue, but it is if they’re duplicated in the processing after being taken from the queue. What’s the best way to go about ensuring as close as possible to linear scalability whilst ensuring there’s no duplication in the processing of the objects? And perhaps linked to that, should the whole object be stored in the message queue, or only the id with the body stored in something like cassandra?
Thank you!
Edit: Confirmed where the duplication occurs. Also, so far I’ve had 2 recommendations for Redis. I’d previously been considering RabbitMQ. What are the pros and cons of each with regards to my requirements?
Without knowing how the messages are created within the system, the mechanism the producer uses for publishing to the queue, and knowing with queue system is in use, it’s difficult to diagnose what’s going on.
I’ve seen this scenario happen in a number of different ways; timed-out workers causing the message to become visible again in the queue (and thus processed a second time, this is common with Kestrel), misconfigured brokers (HA ActiveMQ comes to mind), misconfigured clients (Spring plus Camel routing comes to mind), clients double submitting, etc. There are just a number of ways this kind of issue can come up.
Since I can’t really diagnose the issue, I’ll plug redis here. You could easily combine something like SPOP (which is O(1), as is SADD) with pub/sub for an incredibly fast, constant time, duplicate free (sets must contain unique elements) queue. Although it’s a ruby project, resque may be able to help. It’s at least worth looking at.
Best of luck.