I’m trying to find a more performance way to search a log table. The table logs all searches performed on a site, and can contain multiple filters on a single criteria. For example, users can search for homes in multiple counties, and multiple property types on a single search. I need to be able to run a report to find how many users searched within a specific county/counties with a specific property type/types. The searches are currently logged in the following tables:
Stores the dimension definitions for a search:
CREATE TABLE [dbo].[LogSearchDimensions](
[ID] [int] IDENTITY(1,1) NOT NULL,
[VarName] [nvarchar](50) NOT NULL,
[Label] [nvarchar](50) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[Created] [datetime] NOT NULL,
CONSTRAINT [PK_LogSearchDimensions] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Example data:
ID VarName Label
----------- -------------------------------------------------- --------------------------------------------------
3 City_ID City ID
5 County_ID County ID
7 PageNum Page Number
8 PriceLow Lowest Price
9 PriceHigh Highest Price
10 Region_ID Region ID
11 Site_ID Site ID
14 AcreLow Lowest Acreage
15 AcreHigh Highest Acreage
16 State_ID State ID
17 Style Style
18 SiteStateID Site State ID
19 Distance Distance
20 FIPS FIPS Code
Stores the primary search information, such as when a search was performed, and who performed it:
CREATE TABLE [dbo].[LogSearches](
[ID] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
[RecordCount] [int] NOT NULL,
[PageNumber] [int] NOT NULL,
[IPAddress] [varchar](15) NOT NULL,
[Domain] [nvarchar](150) NOT NULL,
[ScriptName] [nvarchar](500) NOT NULL,
[QueryString] [varchar](max) NULL,
[Referer] [nvarchar](1024) NOT NULL,
[SearchString] [nvarchar](max) NOT NULL,
[UserAgent] [nvarchar](2048) NULL,
[Processed] [datetime] NOT NULL,
[Created] [datetime] NOT NULL,
[IntegerIP] [int] NULL,
CONSTRAINT [PK_LogSearches] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Stores the dimensions for each search. This could be a single record, or 50, depending on the search that was performed:
CREATE TABLE [dbo].[LogSearchesDimensions](
[ID] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
[LogSearch_ID] [numeric](18, 0) NOT NULL,
[LogSearchDimension_ID] [int] NOT NULL,
[SearchValue] [bigint] NULL,
CONSTRAINT [PK_LogSearchesDimensions] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[LogSearchesDimensions] WITH CHECK ADD CONSTRAINT [FK_LogSearchesDimensions_LogSearchDimensions] FOREIGN KEY([LogSearchDimension_ID])
REFERENCES [dbo].[LogSearchDimensions] ([ID])
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[LogSearchesDimensions] CHECK CONSTRAINT [FK_LogSearchesDimensions_LogSearchDimensions]
GO
ALTER TABLE [dbo].[LogSearchesDimensions] WITH CHECK ADD CONSTRAINT [FK_LogSearchesDimensions_LogSearches] FOREIGN KEY([LogSearch_ID])
REFERENCES [dbo].[LogSearches] ([ID])
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[LogSearchesDimensions] CHECK CONSTRAINT [FK_LogSearchesDimensions_LogSearches]
GO
In the LogSearchesDimensions table, I could have multiple records for County_ID (LogSearchDimension_id 5) if the user searched for more than one county in a single search. Let’s assume the user searched in counties 5, 6, 7, 12, and 15. When I’m running a report, this single search would need to show up in the reports for all 5 counties that were searched. If I ran a report that combined counties 5 and 6, then it should only show once.
I know this is a lot of information, and probably still not enough, but I’m hoping someone who’s done similar could share some tips for making this type of filter work with some degree of speed.
I currently have a very complicated query, with all sorts of joins and having conditions to try and find searches with the proper number of matches, but it’s not performing well at all. I’ve attached a diagram of the tables to show the relationships.
(source: dansshorts.com)
I’m currently searching using the following methods:
CREATE VIEW [dbo].[vwLogSearchesCounty]
AS
SELECT dbo.LogSearches.ID, CAST(FLOOR(CAST(dbo.LogSearches.Created AS FLOAT)) AS DATETIME) AS Created, C.county, C.County_ID, S.state, S.state_ID
FROM dbo.LogSearches WITH (NOLOCK) INNER JOIN
dbo.LogSearchesDimensions AS D WITH (NOLOCK) ON dbo.LogSearches.ID = D.LogSearch_ID AND D.LogSearchDimension_ID = 5 INNER JOIN
propertyControlCenter.dbo.county AS C WITH (NOLOCK) ON C.County_ID = D.SearchValue INNER JOIN
propertyControlCenter.dbo.state AS S WITH (nolock) ON C.state_ID = S.state_ID
DECLARE @LowDate DATETIME, @HighDate DATETIME;
SET @LowDate = '2010-01-01' ;
SET @HighDate = '2010-02-01' ;
SELECT
CONVERT(varchar, Created, 107) AS displayDate
, County , County_ID
, reportCount
FROM (
SELECT
Created
, County_ID , County , reportCount
, DENSE_RANK() OVER(ORDER BY MaxRecords DESC, County) AS theRank
FROM (
SELECT
v.County_ID
, v.Created
, C.county + ' County, ' + S.State_Code AS County
, COUNT(DISTINCT v.ID) AS reportCount
, MAX(COUNT(DISTINCT v.ID)) OVER(PARTITION BY v.County_ID) AS MaxRecords
FROM
vwLogSearchesCounty v (NOLOCK)
INNER JOIN propertyControlCenter.dbo.county C (NOLOCK) ON
v.County_ID = C.County_ID
AND v.Created BETWEEN @LowDate AND @HighDate
AND c.State_ID = 48
INNER JOIN propertyControlCenter.dbo.state S (NOLOCK) ON C.state_ID = S.state_ID
INNER JOIN LogSearchesDimensions D (NOLOCK) ON v.ID = D.LogSearch_ID AND D.LogSearchDimension_ID IN (8, 9, 14, 15, 17)
WHERE
1 = 0
OR (
D.LogSearchDimension_ID = 17
AND D.SearchValue IN (2,5,6 )
)
GROUP BY
v.Created
, v.County_ID
, C.County
, S.State_Code
HAVING COUNT(v.ID) >= 3 ) d
GROUP BY
Created
, County
, County_ID
, reportCount
, MaxRecords )
ranking
WHERE theRank <= 5
ORDER BY theRank, County_ID, Created
Current record counts:
LogSearches: 8,970,000
LogSearchesDimensions: 37,630,000
The above query takes 24 seconds to run (too long for our purposes) and returns the following data:
displayDate County County_ID reportCount
------------------------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------- -----------
Jan 01, 2010 Bastrop County, TX 6218 49
Jan 02, 2010 Bastrop County, TX 6218 84
Jan 03, 2010 Bastrop County, TX 6218 76
Jan 04, 2010 Bastrop County, TX 6218 118
Jan 05, 2010 Bastrop County, TX 6218 92
Jan 06, 2010 Bastrop County, TX 6218 59
Jan 07, 2010 Bastrop County, TX 6218 45
Jan 08, 2010 Bastrop County, TX 6218 84
Jan 09, 2010 Bastrop County, TX 6218 71
Jan 10, 2010 Bastrop County, TX 6218 91
Jan 11, 2010 Bastrop County, TX 6218 67
Jan 12, 2010 Bastrop County, TX 6218 52
Jan 13, 2010 Bastrop County, TX 6218 76
Jan 14, 2010 Bastrop County, TX 6218 104
Jan 15, 2010 Bastrop County, TX 6218 69
Jan 16, 2010 Bastrop County, TX 6218 51
Jan 17, 2010 Bastrop County, TX 6218 105
Jan 18, 2010 Bastrop County, TX 6218 76
Jan 19, 2010 Bastrop County, TX 6218 72
Jan 20, 2010 Bastrop County, TX 6218 69
Jan 21, 2010 Bastrop County, TX 6218 32
Jan 22, 2010 Bastrop County, TX 6218 54
Jan 23, 2010 Bastrop County, TX 6218 60
Jan 24, 2010 Bastrop County, TX 6218 76
Jan 25, 2010 Bastrop County, TX 6218 95
Jan 26, 2010 Bastrop County, TX 6218 73
Jan 27, 2010 Bastrop County, TX 6218 64
Jan 28, 2010 Bastrop County, TX 6218 57
Jan 29, 2010 Bastrop County, TX 6218 41
Jan 30, 2010 Bastrop County, TX 6218 87
Jan 31, 2010 Bastrop County, TX 6218 67
Feb 01, 2010 Bastrop County, TX 6218 66
Jan 01, 2010 Montgomery County, TX 6199 51
Jan 02, 2010 Montgomery County, TX 6199 70
Jan 03, 2010 Montgomery County, TX 6199 69
Jan 04, 2010 Montgomery County, TX 6199 74
Jan 05, 2010 Montgomery County, TX 6199 44
Jan 06, 2010 Montgomery County, TX 6199 60
Jan 07, 2010 Montgomery County, TX 6199 37
Jan 08, 2010 Montgomery County, TX 6199 39
Jan 09, 2010 Montgomery County, TX 6199 40
Jan 10, 2010 Montgomery County, TX 6199 71
Jan 11, 2010 Montgomery County, TX 6199 63
Jan 12, 2010 Montgomery County, TX 6199 54
Jan 13, 2010 Montgomery County, TX 6199 51
Jan 14, 2010 Montgomery County, TX 6199 46
Jan 15, 2010 Montgomery County, TX 6199 54
Jan 16, 2010 Montgomery County, TX 6199 45
Jan 17, 2010 Montgomery County, TX 6199 73
Jan 18, 2010 Montgomery County, TX 6199 70
Jan 19, 2010 Montgomery County, TX 6199 30
Jan 20, 2010 Montgomery County, TX 6199 57
Jan 21, 2010 Montgomery County, TX 6199 59
Jan 22, 2010 Montgomery County, TX 6199 43
Jan 23, 2010 Montgomery County, TX 6199 49
Jan 24, 2010 Montgomery County, TX 6199 72
Jan 25, 2010 Montgomery County, TX 6199 86
Jan 26, 2010 Montgomery County, TX 6199 43
Jan 27, 2010 Montgomery County, TX 6199 69
Jan 28, 2010 Montgomery County, TX 6199 57
Jan 29, 2010 Montgomery County, TX 6199 46
Jan 30, 2010 Montgomery County, TX 6199 52
Jan 31, 2010 Montgomery County, TX 6199 107
Feb 01, 2010 Montgomery County, TX 6199 40
Jan 01, 2010 Fayette County, TX 6240 58
Jan 02, 2010 Fayette County, TX 6240 65
Jan 03, 2010 Fayette County, TX 6240 50
Jan 04, 2010 Fayette County, TX 6240 61
Jan 05, 2010 Fayette County, TX 6240 52
Jan 06, 2010 Fayette County, TX 6240 48
Jan 07, 2010 Fayette County, TX 6240 44
Jan 08, 2010 Fayette County, TX 6240 40
Jan 09, 2010 Fayette County, TX 6240 25
Jan 10, 2010 Fayette County, TX 6240 56
Jan 11, 2010 Fayette County, TX 6240 51
Jan 12, 2010 Fayette County, TX 6240 47
Jan 13, 2010 Fayette County, TX 6240 43
Jan 14, 2010 Fayette County, TX 6240 47
Jan 15, 2010 Fayette County, TX 6240 43
Jan 16, 2010 Fayette County, TX 6240 37
Jan 17, 2010 Fayette County, TX
Preliminary Performance Analysis:
Principal Observaton And Recommendation:
It appears that
Therefore, the best thing you could do would be to :
0. Re-implement it in a SQL Server OLAP Database:
OLAP databases (or Analysis Services or SSAS) are much better suited for this kind of application and much better optimized for this kind of query. However, on the assumption that this would be far beyond the scope of your current intentions, I will proceed with lower impact recommendations.
Preliminary Observations and Recommendations:
these are necessarily preliminary as real performance analysis cannot be done on something this complex without at least an estimated query plan. In particular, although recommendations for possible improvements can be made, without the real data, or query plans based on the real data, it is impossible to tell which changes will actually make any real difference in performance.
The basic problem that I see is that this very complex query is doing a lot of full-table scans and because of its complexity is almost effectively forcing these scans. Table scans can be very expensive, especially on large tables. However, if there is no effective way for SQL to pre-filter a table (that is, to eliminate 99% of the records via a test that does not require a join and can be applied to an index), then that is the most efficient choice that it has left. In that light, the following recommendations look at ways to remedy that situation:
A. Make a SARGable [Created] column on the [vwLogSearchesCounty] View.
Currently the [vwLogSearchesCounty] View defines its [Created] column like this:
The apparent reason for this is to allow the use of the BETWEEN operator on this column when searching for Date Ranges, without have to worry about the problems of time-of-day mismatches. However this has the side-effect of making this column non-SARGable, that is, not searchable via an index (“SARG” = “Search ARGument”). As the searches on this column are one of the few globally-fixed constraints (that is, something that can be searched on before doing joins) making it SARGable is one of your best oppurtunities for speeding up your queries.
To make it SARGable, you need to do the following:
Either remove the CAST(FLOOR(CAST(..))) function from the [Created] column or add a second column to the view without them (I will assume the later) like this:
Change the Date-Range search in the view, as below, from
to
Make sure that there is an Index on the original [Created] column. A clustered index would be best:
(for more on correctly searching for dates and date ranges in SQL Server see here: http://www.sqlservercentral.com/Forums/Topic438717-338-1.aspx)
B. Make [LogSearchDimension_ID] column in the [LogSearchesDimensions] table Indexed.
I do not know if this column is already indexed, but it should be as it is another of the very few globally-fixed constraints. Again, a clustered index would be best:
C. Make [State_code] column on the [State] table Indexed.
Like (B) above, it may not be indexed, but your query would probably benefit if it was.
D. Eliminate the explicit [County] and [State] tables from your query.
It appears that the only reason that you are explicitly joining the [County] and [State] tables in your query is to get the [State_Code] column from the [State] table that is not included in the [vwLogSearchesCounty] view. Ideally, the optimizer would figure out that it could merge these two references to the [State] and [County] tables together and you could pick up that column for free. Unfortunately the optimizer often misses such things on more complex queries, and on my server it appears to be doing double work for this field.
The best solution would be to just add the [State_code] column to the view and then eliminate the explicit [county] and [State] joins in your query. this additional colum should not affect the performance of any other queries that use the view but do not use this column (SQL Server 2005 and above).
If you implement all four of these recommendations (including the CLUSTERED part) you should be able to elminate the table scans. At this time I see no other reasonable way to do so.
More Desparate Recommendation:
If all of this should fail to help you, I can only see one other relatively low-cost option, which is to turn your View [vwLogSearchesCounty], into an Indexed View. In particular, the Indexes that I mentioned above on [State_Code] and [Created]/[CreatedTime] should be reimplemented on the View (as well as any other important ones).
This has the effect, not only of indexing important fields, but also of pre-aggregating the View and automatically maintaining that pre-aggregation. This also means that it will take up storage space, so if this view has a very large number of rows, that may be an issue for you.