I have a large csv file (around 700MB) that I am trying to parse and insert into a MySQL database. I read the csv (around 4×10^6 rows) line by line and parse the records to insert. I then insert the records into the database in batches of about 10k records per batch. There are a few things during parsing, e.g. converting a duration of format 11d 12:34:56 into number of hours using preg_match.
preg_match('/(?P<days>\d+)d (?P<hours>\d+)?P<minutes>\d+)?P<seconds>\d+)/', $hoursUsed, $matches);
The script takes about 40 minutes to completely parse the file and insert all records into the database. The questions that I have here are:
* What should be expected time? I wonder if 40 minutes is normal or not?
* Could the parsing of the csv file be
I am parsing a file(csv) of size around 700MB in PHP (around 4×10^6 rows) but it is taking around 40 minutes to parse the file. I am trying to optimize the parsing but only able to optimize it from 45 to 40 minutes. My questions are:
- What should be expected time? I wonder if 40 minutes is normal or not?
- I do this with the request so there is no response until the file is completely parsed and everything is inserted. Is there a better way to delegate this to an asynchronous process?
FYI I am using CakePHP.
Using
LOAD DATA INFILEwould speed things up considerably. Just load the duration value in aCHARfield and let MySQL process it later.That way, you leave the data processing to the database, which will be significantly faster than PHP.
Further, 40 minutes sounds not too bad for 700MB and 4 million records. Of course it all depends on the code, the machine, etc.