I recently experienced a performance problem in Entity Framework querying against SQL Server 2008. I managed to fix the issue, but I don’t understand why my fix worked. I’m using a collection of Guids with a .Contains() method to generate an IN clause in SQL. Here’s the original code (table names changed to protect the innocent):
Guid[] values = filter.Split(',').Select<String, Guid>(d => new Guid(d)).ToArray();
returnValue = returnValue.Where(t => values.Contains(t.WorkItem.Requirement.Project.ProjectId));
This query takes ~20 seconds for execution when there are > 150 ProjectID’s. By changing the location of the .Contains() I can speed things up dramatically. Here’s the refactor:
Guid[] values = filter.FilterValue.Split(',').Select<String, Guid>(d => new Guid(d)).ToArray();
var projects = from p in context.DC_DEF_Project
where values.Contains(p.ProjectId)
select p;
returnValue = from t in returnValue
join p in projects on t.DC_DEF_ProjectWorkItem.DC_DEF_ProjectRequirement.ProjectId equals p.ProjectId
select t;
This code takes ~0.125 seconds on the same data set as the above query.
I’m sure there’s a sane reason for this, but my curiosity is killing me. What is it?
My guess would be that the first one results in sql with a bunch or ORs evaluated against the foreign key on workitems after all the joins where the second joins mach to projects by it’s primary key and evaluates the 150 ids only once then joins that to the other tables.