In the program I created the following logic for reading the data from the database and storing it into List<>:
NpgsqlCommand cmd = new NpgsqlCommand(query, conn);
List<UserInfo> result = new List<UserInfo>();
Npgsql.NpgsqlDataReader rdr = cmd.ExecuteReader();
while (rdr.Read())
{
string userId = rdr[0].ToString();
string sex = rdr[1].ToString();
string strDateBirth = rdr[2].ToString();
string zip = rdr[3].ToString();
UserInfo userInfo = new UserInfo();
userInfo.Msisdn = userId;
userInfo.Gender = sex;
try
{
userInfo.BirthDate = Convert.ToDateTime(strDateBirth);
}
catch (Exception ex)
{
}
userInfo.ZipCode = zip;
userInfo.DemographicsKnown = true;
userInfo.AgeGroup = getAgeGroup(strDateBirth);
if (result.Count(x => x.Id== userId) == 0)
result.Add(userInfo);
}
The performance of this code is really poor. There are over 2M of records and after half an hour the list userInfo contains just 300.000 records.
Does anyone know how to speed up data reading from the database?
You are using
.Countwhen you really mean.Any()Whenever you call
.Countyou are enumerating the entire collection just to see if you have a single match….Consider the question you’re asking:
“How many rows do you have that match this condition? Is that number equal to zero?”
What you really mean is:
“Do any rows match this condition?”
In that context, you could create a Hashset of the userId values. Checking for the existence in a Hashset (or dictionary) can be much faster than checking the same in a list.
Furthermore, if you do already have the userId, then you parsed and read all the values for no reason. Check for
myHashset.Contains(userId)first, then add.This is the primary reason it’s slow. For n rows you’re performing the nth-triangular enumerations of the collection!
EDIT: Consider this untested change: I don’t know if your reader supports typed read methods like
GetString()so if it doesn’t then simply use what you had before.This will always keep the first instance of a user row you find (which is what your current code does). If you want to keep the last instance then you can use a dictionary and eliminate the
.Contains()call altogether.EDIT: I just noticed that my sample never added the userId to the hash… whoops… added it in there.