My company is moving to SQL Server 2008 R2. We have a table with tons of archive data. Majority of the queries that uses this table employs DateTime value in the where statement. For example:
Query 1
SELECT COUNT(*)
FROM TableA
WHERE
CreatedDate > '1/5/2010'
and CreatedDate < '6/20/2010'
I’m making the assumption that partitions are created on CreatedDate and each partition is spread out across multiple drives, we have 8 CPUs, and there are 500 million records in the database that are evenly spread out across the dates from 1/1/2008 to 2/24/2011 (38 partitions). This data could also be portioned in to quarters of a year or other time durations, but lets keep the assumptions to months.
In this case I would believe that the 8 CPU’s would be utilized, and only the 6 partitions would be queried for dates between 1/5/2010 and 6/20/2010.
Now what if I ran the following query and my assumptions are the same as above.
Query 2
SELECT COUNT(*)
FROM TableA
WHERE State = 'Colorado'
Questions?
1. Will all partitions be queried? Yes
2. Will all 8 CPUs be used to execute the query? Yes
3. Will performance be better than querying a table that is not partitoned? Yes
4. Is there anything else I’m missing?
5. How would Partition Index help?
I answer the first 3 questions above, base on my limited knowledge of SQL Server 2008 Partitioned Table & Parallelism. But if my answers are incorrect, can you provide feedback any why I’m incorrect.
Resource:
- Video: Demo SQL Server 2008 Partitioned Table Parallelism (5 minutes long)
- MSDN: Partitioned Tables and Indexes
- MSDN: Designing Partitions to Manage Subsets of Data
- MSDN: Query Processing Enhancements on Partitioned Tables and Indexes
- MSDN: Word Doc: Partitioned Table and Index Strategies Using SQL Server 2008 white paper
BarDev
Partitioning can increase performance–I have seen it many times. The reason partitioning was developed was and is performance, especially for inserts. Here is an example from the real world:
I have multiple tables on a SAN with one big ole honking disk as far as we can tell. The SAN administrators insist that the SAN knows all so will not optimize the distribution of data. How can a partition possibly help? Fact: it did and does.
We partitioned multiple tables using the same scheme (FileID%200) with 200 partitions ALL on primary. What use would that be if the only reason to have a partitioning scheme is for “swapping”? None, but the purpose of partitioning is performance. You see, each of those partitions has its own paging scheme. I can write data to all of them at once and there is no possibility of a deadlock. The pages cannot be locked because each writing process has an unique ID that equates to a partition. 200 partitions increased performance 2000x (fact) and deadlocks dropped from 7500 per hour to 3-4 per day. This for the simple reason that page lock escalation always occurs with large amounts of data and a high volume OLTP system and page locks are what cause deadlocks. Partitioning, even on the same volume and file group, places the partitioned data on different pages and lock escalation has no effect since processes are not attempting to access the same pages.
THe benefit is there, but not as great, for selecting data. But typically the partitioning scheme would be developed with the purpose of the DB in mind. I am betting Remus developed his scheme with incremental loading (such as daily loads) rather than transactional processing in mind. Now if one were frequently selecting rows with locking (read committed) then deadlocks could result if processes attempted to access the same page simultaneously.
But Remus is right–in your example I see no benefit, in fact there may be some overhead cost in finding the rows across different partitions.