I want to get the community’s perspective on this. If I have a process which is heavily DB/IO bound, how smart would it be to parallelize individual process paths using the Task Parallel library?
I’ll use an example … if I have a bunch of items, and I need to do the following operations
- Query a DB for a list of items
- Do some aggregation operations to group certain items based on a dynamic list of parameters.
- For each grouped result, Query the database for something based on the aggregated result.
- For each grouped result, Do some numeric calculations (3 and 4 would happen sequentially).
- Do some inserts and updates for the result calculated in #3
- Do some inserts and updates for each item returned in #1
Logically speaking, I can parallelize into a graph of tasks at steps #3, #5, #6 as one item has no bearing on the result of the previous. However, each of these will be waiting on the database (sql server) which is fine and I understand that we can only process as far as the SQL server will let us.
But I want to logically distribute the task on the local machine so that it processes as fast as the Database lets us without having to wait for anything on our end. I’ve done some mock prototype where I substitute the db calls with Thread.Sleeps (I also tried some variations with .SpinWait, which was a million times faster), and the parallel version is waaaaay faster than the current implementation which is completely serial and not parallel at all.
What I’m afraid of is putting too much strain on the SQL server … are there any considerations I should consider before I go too far down this path?
Another option would be to create a pipeline so that step 3 for the second group happening at the same time as step 4 for the first group. And if you can overlap the updates at step 5, do that too. That way you’re doing concurrent SQL accesses and processing, but not over-taxing the database because you only have two concurrent operations going on at once.
So you do steps 1 and 2 sequentially (I presume) to get a collection of groups that require further processing. Then. your main thread starts:
A second thread services the results queue:
A third thread services the update queue:
The
System.Collections.Concurrent.BlockingCollection<T>is a very effective queue for this kind of thing.The nice thing here is that if you can scale it if you want by adding multiple calculation threads or query/update threads if the SQL Server can handle more concurrent transactions.
I use something very similar to this in a daily merge/update program, with very good results. That particular process doesn’t use SQL server, but rather standard file I/O, but the concepts translate very well.