Some time ago I asked the question SQL Server: How do I maintain data integrity using aggregate functions with group by? I got a great answer there, but now the problem has come up again, with Linq to SQL this time rather than plain SQL.
Backstory: I have a table full of gps data, which looks like this:
GPS_id, user_id, latitude, longitude, server_time, device_time
I used the following linq query to pull out the most recent gps record for a certain set of users:
var query =
from gps in db.gps_data
where (from u in db.users
select u.user_id).Contains(gps.user_id)
group gps by gps.user_id into groupedGPS
select groupedGPS;
I then looped through it like so, but I had to order it first in order the properly grab the “most recent record”.
foreach (var gpsItem in query) {
var ordered = gpsItem.OrderByDescending(g => g.device_time);
list.Add(ordered.First());
}
This gave me what I needed, but at any time I have 100+ users all with 500+ gps records (and all of them were being accessed this way), so this code was taking 10+ seconds which I deemed unacceptable.
I then changed it to the following
var query =
from gps in db.gps_data
where (from u in db.users
select u.user_id).Contains(gps.user_id)
group gps by gps.user_id into groupedGPS
select new
{
GPS_id = groupedGPS.Max(x => x.GPS_id),
user_id = groupedGPS.Max(x => x.user_id),
latitude = groupedGPS.Max(x => x.latitude),
longitude = groupedGPS.Max(x => x.longitude),
server_time = groupedGPS.Max(x => x.server_time),
device_time = groupedGPS.Max(x => x.device_time)
};
This query did seem faster, because as far as my understanding goes all of the unnecessary data is never actually loaded into memory. However, as in my original question of several months ago, I’ve lost my data integrity this way. There’s no guarantee that I’m seeing the most recent record, just the maximum value for all the fields in the grouping. This has no effect on most of the fields, but the latitude and longitude are almost always incorrect, as they are merely the max() records found in the grouping rather than the most recent ones.
How do I get around this issue? I realize I have the first solution retrieving me the correct data, but the amount of time it takes is far too long.
Thanks for the help!
As far as I understand your question (newest record per user id), this would seem like it would do what you’re looking for;
It may give you duplicates if there are multiple readings at the same time for a user, I’m assuming they’re not.