I have an JavaEE 6 / EJB3.1 / Glassfish 3.1.2 application that retrieves .xml pages from a remote computer, converts them to java objects, then persists each of them in my mysql database. There are tens of thousands of these .xml pages and I’m just adding them incrementally.
This is working great, except that it is very slow (70ms page retrieval + a tiny amount of time to convert & persist the entities).
I want to carry out this work concurrently to speed it up – what is the best method?
Possibly worth noting: each page retrieval updates a count in the mysql database for an OAuth credential it is using to get the page, and if it is at the max, it doesn’t continue (throws an exception). I’m not sure if / how much this will complicate matters – but if two threads see it is below max, then get the page before updating the count it could go over the max.
My research so far has narrowed it down to two possibilities (feel free to add others though):
- Message Driven Beans – I imagine, though are probably wrong, that I would have a session bean sending url messages until the message queue is full (say 10 url’s are added), then blocks until the queue is not full. Glassfish will create 10 instances of a message bean I create who each get one of the .xml from one of the urls, update the OAuth count, then send this .xml as a message to another queue with another message bean that converts & persists .xmls from this queue.
- Use the @Asynchronous method and create my own thread safe queue’s? This could be much simpler and more suited to what I am doing but I’m not sure exactly how I would implement it.
Any advice would be appreciated!
Since you’re dealing with a remote server, do you know how well it’s going to scale? If you hit it with 10 threads will your response time become 700ms or will it stay steady at 70ms?
Assuming that the remote server will scale I think you’re spot on with the MDB idea. However, some of your thoughts about it are inaccurate. You would create the session bean that submits to the queue. Where we differ is that I think you would want load the queue as quickly as the work is available. You can set a queue size and tell it whether it want it to throw away oldest or newest if that size is met. I suspect you want to consume all messages, and you can do that as well. I run queues with 100’s of thousands of messages in them. You’re really just limited to the in memory size of the queue which you can manage by making your messages as tight as possible.
On the consumption side, you would restrict the MDB pool to be 10 beans or whatever, it just depends on what the remote server is capable of scaling to and what your server is capable of scaling too. Rather than using 2 queues (and this is just based on the problem you described) I would use just the one. Create an MDB that does everything you’re doing now, i.e. Grabbing the xml and persisting it. Lastly, if you find that you need to scale, it’s just a matter of creating a cluster and adding nodes. Each node will then have an MDB pool that it’s working with.
With regards to the Asychronous, how are going to control pool size and all the other things MDB’s give you? I’m not saying you couldn’t but it seems you’d be reinventing the wheel.