i have a task where i need to parse an extremely big file and write the results into a mysql database. “extremely big” means we are talking about 1.4GB of sort-of-CSV data, totalling in approx 10 million lines of text.
Thing is not “HOW” to do it, but how to do it FAST. my first approach was to just do it in php without any speed optimization and then let it run for a few days until it’s done. unfortunately, it’s been running for 48 hours straight right now and has processed only 2% of the total file. therefore, that’s not an option.
the file format is as follows:
A:1,2
where the amount of comma separated numbers following the “:” can be 0-1000. the example dataset has to go into a table as follows:
| A | 1 |
| A | 2 |
so right now, i did it like this:
$fh = fopen("file.txt", "r");
$line = ""; // buffer for the data
$i = 0; // line counter
$start = time(); // benchmark
while($line = fgets($fh))
{
$i++;
echo "line " . $i . ": ";
//echo $i . ": " . $line . "<br>\n";
$line = explode(":", $line);
if(count($line) != 2 || !is_numeric(trim($line[0])))
{
echo "error: source id [" . trim($line[0]) . "]<br>\n";
continue;
}
$targets = explode(",", $line[1]);
echo "node " . $line[0] . " has " . count($targets) . " links<br>\n";
// insert links in link table
foreach($targets as $target)
{
if(!is_numeric(trim($target)))
{
echo "line " . $i . " has malformed target [" . trim($target) . "]<br>\n";
continue;
}
$sql = "INSERT INTO link (source_id, target_id) VALUES ('" . trim($line[0]) . "', '" . trim($target) . "')";
mysql_query($sql) or die("insert failed for SQL: ". mysql_error());
}
}
echo "<br>\n--<br>\n<br>\nseconds wasted: " . (time() - $start);
this is obviously not optimized for speed in ANY way. any hints for a fresh start? should i switch to another language?
The first optimization would be to insert with a transaction – each 100 or 1000 lines commit and begin a new transaction. Obviously you’d have to use a storage engine that supports transactions.
Then observe the CPU usage with the
topcommand – if you have multiple cores, the mysql process does not do much and the PHP process does much of the work, rewrite the script to accept a parameter that skips n lines from the beginning and only import 10000 lines or so. Then start multiple instances of the script, each with a different starting point.Third solution would be to convert the file into a CSV with PHP (no INSERT at all, just writing to a file) and the using
LOAD DATA INFILEas m4t1t0 suggested.