I have a database table in Production used to store the workflow of a given item; each record of the table represents basically the status of an item on a specific date.
The oversimplified table structure is something like this:
Workflow table
|-------------|------------|---------|----------------|
| Category | ItemCode | Status | InsertDate |
|-------------|------------|---------|----------------|
| Cat1 | Foo1 | 01 | 2012-01-01 |
|-------------|------------|---------|----------------|
| Cat1 | Foo1 | 02 | 2012-03-02 |
|-------------|------------|---------|----------------|
| Cat1 | Foo1 | 03 | 2012-04-01 |
|-------------|------------|---------|----------------|
| Cat1 | Foo2 | 01 | 2012-04-06 |
|-------------|------------|---------|----------------|
| Cat1 | Foo2 | 02 | 2012-05-07 |
|-------------|------------|---------|----------------|
| Cat1 | Foo2 | 04 | 2012-05-09 |
|-------------|------------|---------|----------------|
| Cat2 | Foo3 | 01 | 2011-02-03 |
|-------------|------------|---------|----------------|
| ... | ... | .. |.... |
|-------------|------------|---------|----------------|
So, at 2012-01-01 the Item Foo1 has reached the Status 01; at 2012-04-01 has reached the status 03 and so on.
The StoredProcedure PR_GetCategoryItemsInformation, taking a given Category as input, reads the Workflow table and gives a result like this:
@Input: Cat1
Output:
|------------------|---------------|------------------|---------------------|
| Category | ItemCode | DateOfFirstRecord| StatusOfLatestRecord|
|------------------|---------------|------------------|---------------------|
| Cat1 | Foo1 | 2012-01-01 | 03 |
| Cat1 | Foo2 | 2012-04-06 | 04 |
The SP, given a Category, for each ItemCodeneeds to get the first row of the workflow to read the InsertDate and the last row of the workflow to get the current Status.
It boils down in a SP implementation that looks like this:
CREATE PROCEDURE dbo.PR_GetFooItemInformation
@Category CHAR(3)
AS
BEGIN
CREATE TABLE #TabTemp (
Category CHAR(3),
ItemCode CHAR(3),
Status CHAR(2),
InsertDate DATETIME
)
CREATE CLUSTERED INDEX XIE1TabTemp
ON #TabTemp (...)
CREATE NONCLUSTERED INDEX XIE2TabTemp
ON #TabTemp (...)
INSERT INTO #TabTemp
SELECT
Category,
ItemCode,
Status,
InsertDate
FROM Workflow
WHERE (Some rules to cut down the number of rows)
SELECT
T1.Category,
Item.ItemCode,
T1.InsertDate,
T2.Status
FROM
Item
INNER JOIN
#TabTemp as T1 ON Item.ItemCode = Workflow.ItemCode
INNER JOIN
#TabTemp as T2 ON Item.ItemCode = Workflow.ItemCode
WHERE
...
AND
T1.InsertDate= SELECT
MIN(InsertDate)
FROM
#TabTemp as T3
WHERE ..
AND
T2.InsertDate = SELECT
MAX(InsertDate)
FROM
#TabTemp as T4
WHERE ..
The SP has worked as expected for many years (2005), but a couple of months ago it started to give some random timeout; since the number of records of the workflow table is growing (2.5M and counting), its performance will surely get worse and worse *.
The tables are properly indexed and, for what it’s worth, the sql management studio does not suggest any further indexes on the SP.
The same SP without using the temporary table is something like 4x slower.
The temp table at this time, is being populated by an average of 1.5M of rows on each call.
The problem, to my limited dba knowledge, is related to the MIN and MAX functions that need to be calculated to reach the first and the last row for each item of a given category.
I have omitted several details on the workflow table and on the SP implementation but I hope that what I’ve described could be enough to get an idea of the problem.
Finally the question:
do you know any sql strategies or even sql-server proprietary solutions to handle this kind of scenario?
What kind of restrictions do I have?
Well, the SP is used on a BackOffice function and should return all the live records and not a preprocessed subset.
* I’m not a dba; one of the dba is currently studying this little monster in his dark laboratory.
The transformation that you suggest can be done by a relatively simple query:
I realize that this is more complicated once you put in your conditions.
In general, I prefer to have the SQL optimizer choose the best way to execte a query, rather than having temporary tables. (Having said that, there have been some very unpleasant experiences where I did have to resort to multiple queries because the optimizer chose the wrong plan.)
I suggest that you try this and see if it fixes your performance problem.