We’ve just built a system that rolls up its data at midnight. It must iterate through several combinations of tables in order to rollup the data it needs. Unfortunately the UPDATE queries are taking forever. We have 1/1000th of our forecasted userbase and it already takes 28 minutes to rollup our data daily with just our beta users.
Since the main lag is UPDATE queries, it may be hard to delegate servers to handle the data processing. What are some other options for optimizing millions of UPDATE queries? Is my scaling issue in the code below?:
$sql = "SELECT ab_id, persistence, count(*) as no_x FROM $query_table ftbl
WHERE ftbl.$query_col > '$date_before' AND ftbl.$query_col <= '$date_end'
GROUP BY ab_id, persistence";
$data_list = DatabaseManager::getResults($sql);
if (isset($data_list)){
foreach($data_list as $data){
$ab_id = $data['ab_id'];
$no_x = $data['no_x'];
$measure = $data['persistence'];
$sql = "SELECT ab_id FROM $rollup_table WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
if (DatabaseManager::getVar($sql)){
$sql = "UPDATE $rollup_table SET $rollup_col = $no_x WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
DatabaseManager::update($sql);
} else {
$sql = "INSERT INTO $rollup_table (ab_id, rollup_key, $rollup_col, rollup_date) VALUES ($ab_id, '$measure', $no_x, '$day_date')";
DatabaseManager::insert($sql);
}
}
}
When addressing SQL scaling issues, it is always best to benchmark your problematic SQL. Even at the PHP level is fine in this case, as you’re running your queries within PHP.
If your first query could potentially return millions of records, you may be better served running that query as a MySQL stored procedure. That will minimize the amount of data that has to be transferred between database server and PHP application server. Even if both are the same machine, you can still realize a significant performance improvement.
Some questions to consider that may help to resolve your issue follow:
If you’re unfamiliar with writing MySQL stored procedures, the process is quite simple. See http://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx for an example. MySQL has good documentation on this as well. A stored procedure is a program that runs within the MySQL database process, which may help to improve performance when dealing with queries that potentially return millions of rows.
Set-based database operations are often faster than procedural operations. SQL is a set-based language. You can update all rows in a database table with a single UPDATE statement, i.e. UPDATE customers SET total_owing_to_us = 1000000 updates all rows in the customers table, without the need to create a programmatic loop like you’ve created in your sample code. If you have 100,000,000 customer entries, the set-based update will be significantly faster than the procedural update. There are lots of useful resources online that you can read up about this. Here’s a SO link to get started: Why are relational set-based queries better than cursors?.