I am developing a stand-alone java application which gathers data from around 1000 measuring devices over the network and persists the data to a database.
The data gathering can take a couple of minutes per device due to slow device output and/or network speed. The data gathering must take place in a certain time window, so i need to work in parallel.
My approach would be to create one thread per measuring device, put the data in a queue and have one or more other threads at the other end of the queue transform and persist the data.
Is this a viable approach? Will a modern machine be able to handle that many threads and network connections? How scalable is this, at what point would i need to work on several machines?
I would also be grateful if you could give me pointers with regard to the concurreny classes you would recommend (i.e. what kind of queue, ThreadPoolExecutor etc – i havent used java.util.concurrent yet, book is in the mail).
Are there any better approaches?
UPDATE:
Thanks for the answers so far, here is more information requested by some of you.
The data i receive from the devices is in the form of files smaller than 1kb. It is possible that i get something like 25.000 files during one transfer, although usually its far less.
The data transformation is nothing cpu-intensive, basically parsing the file and converting it to java data types (the file contains c-data-types like unsigned char and unix timestamps), plus a CRC calculation. I create an object containing the content of one file which i persist to the datbase using JPA (i guess i could use plain JDBC as well for this case). There is no order in the measurements-files, since they contain the device s/n and a timestamp.
At a later point in time i will have to add some kind of alert when certain criteria are met, but again this shouldn’t be cpu-intensive.
From the answers so far i gather the network connections and number of threads shouldn’t be a problem.
The only thing i’m left wondering is about the approach with the queue. An alternative would be to let the data-gathering threads also call the DAO method to persist the file. I guess i have to make the DAOs thread-safe anyway, but i think a few threads could do the job as well, since the bulk of the time will be spent transferring network-data.
Also i will look into asynchronous I/O and some frameworks that provide it.
Thanks again, i will choose an answer a little bit later, maybe i will get some more input 🙂
With the default settings you’ll end up using around 1Gb of memory for the threads’ stacks, given you are running on 64-bit Linux, Oracle jdk (default threadstacksize is 1Mb on such a platform). I think for OpenJDK it is the same. Not counting buffers allocated by the os . . .
If this is too much for your requirements, you may want to have a look at http://netty.io. This framework uses java nio under the hood (can be configured to use bio, btw). This way you would just require a handful of threads for doing the actual io (performing read/write ops on for a given tcp connection). Your business logic (update db, calculate some measurements) should then be offloaded into a separate threadpool. Netty also includes support for this.
If you want to use 1 thread per connection (per measuring device?), then there is probably no benefit from having yet another bunch of threads doing the actual business work. I assume one thread per device, because you said that the device can be slow and/or the network can be slow. Both bottlenecks (network and device) won’t be eliminated if you’re using multiple threads (one can expect the opposite).
Concurrency classes in general: java.util.concurrent.* yep, both thumbs up