I have a script that imports CSV files. What ends up in my database is, among other things, a list of customers and a list of addresses. I have a table called customer and another called address, where address has a customer_id.
One thing that’s important to me is not to have any duplicate rows. Therefore, each time I import an address, I do something like this:
$address = new Address();
$address->setLine_1($line_1);
$address->setZip($zip);
$address->setCountry($usa);
$address->setCity($city);
$address->setState($state);
$address = Doctrine::getTable('Address')->findOrCreate($address);
$address->save();
What findOrCreate() does, as you can probably guess, is find a matching address record if it exists, otherwise just return a new Address object. Here is the code:
public function findOrCreate($address)
{
$q = Doctrine_Query::create()
->select('a.*')
->from('Address a')
->where('a.line_1 = ?', $address->getLine_1())
->andWhere('a.line_2 = ?', $address->getLine_2())
->andWhere('a.country_id = ?', $address->getCountryId())
->andWhere('a.city = ?', $address->getCity())
->andWhere('a.state_id = ?', $address->getStateId())
->andWhere('a.zip = ?', $address->getZip());
$existing_address = $q->fetchOne();
if ($existing_address)
{
return $existing_address;
}
else
{
return $address;
}
}
The problem with doing this is that it’s slow. To save each row in the CSV file (which translates into several INSERT statements on different tables), it takes about a quarter second. I’d like to get it as close to “instantaneous” as possible because I sometimes have over 50,000 rows in my CSV file. I’ve found that if I comment out the part of my import that saves addresses, it’s much faster. Is there some faster way I could do this? I briefly considered putting an index on it but it seems like, since all the fields need to match, an index wouldn’t help.
This certainly won’t alleviate all of the time spent on tens of thousands of iterations, but why don’t you manage your addresses outside of per-iteration DB queries? The general idea:
Unless I’m misunderstanding the scenario, this way you’re only making INSERT queries if you have to, and you don’t need to perform any SELECT queries aside from the first one.