Basically my consumers are producers as well. We get an initial dataset and it gets sent to the queue. A consumer takes an item and processes it, from that point there’s 3 possibilities:
- Data is good and gets putting a ‘good’ queue for storage
- Data is bad and discarded
- Data is not good(yet) or bad(yet) so data is broken down into smaller parts and sent back to the queue for further processing.
My problem is with step 3, because the queue grows very quickly at first its possible that a piece of data is broken down into a part thats duplicated in the queue and the consumers continue to process it and end up in a infinite loop.
I think the way to prevent against this is to prevent duplicates from going into the queue. I can’t do this on the client side because over the course of an hour I may have many cores dealing with billions of data points(to have each client scan it before submitting would slow me down too much). I think this needs to be done on the server side but, like I mentioned, the data is quite large and I don’t know how to efficiently ensure no duplicates.
I might be asking the impossible but thought I’d give it a shot. Any ideas would be greatly appreciated.
The core problem seems to be this:
You can focus on uniqueness of your queued items all you want, but the issue above is where you should focus your efforts, IMO. One way to prevent infinite looping might be to have a “visited” bit in your message payload that is set by consumers before they re-queue the broken-down item.
Another option would be to have the consumers re-queue back to a special queue that is treated slightly differently to prevent infinite looping. Either way, you should attack the issue by dealing with it as a core part of your application’s strategy rather than using a feature of a messaging system to step around it.