I have SQL Server 2008 database with 2 tables:
- Table A has columns
ID (int), XmlDocument (xml) - Table B has columns
ID (int), PdfDocument (varbinary)
I have some .NET code that can take the XmlDocument and convert into PDF. I have 1.3 million rows in Table A, and to convert all the rows sequentially would take 1.3 millions rows @ 1 row/sec = 15 days.
I want to approach that let’s be do this in less than 2 hours. The problem seems to be a perfect case for parallelization. My question is what should I use to achieve this, and if any one has any good advice that has worked in the past. I have access to a virtual machine lab, and can potentially spin up several (5-6) virtual machines and this is a test database that I can copy wherever.
For example, should I do this in SQL (service broker or sql job for parallellism and calling a CLR proc for the conversion) or .NET (should I have a multiple processess on multiple machines or multiple threads in the same machine will get me pretty close)? What will be the bottle necks? Any suggestions about what strategies I should use to share work between threads?
There are a host of different solutions that could solve this problem but I will suggest something novel. Use the cloud.
Assuming the true bottleneck is the computing power to convert the Xml to a PDF then getting access to an environment with virtually unlimited scale out may prove the quickest way.