I am using Doctrine and MongoDB for an application and there is one task that should import data from an CSV file into a collection. There are about 5 different CSV files with at least 450.000 entries per file that should be important 1 to 3 times per year.
Currently I iterate through all lines of a file, create an object, call persist() and flush every 2.500 items.
Each item is not very large, it has an ID, a string that is 10-20 characters, a string that is 5-10 characters and a boolean value.
My first question is:
When I flush every 5.000 items inserting gets significantly slower. In my test environment flushing 5.000 items took about 102 seconds, flushing 2.500 items took about 10 seconds.
Flushing gets slower after a while. As mentioned, at the beginning flush 2.500 items took about 10 seconds, after 100.000 items, flushing 2.500 items takes nearly one minute.
Are there any optimisations I can do to speed things up?
I think there are two parts that may be optimized in your script.
file_get_contents()orfile()for example), or reading it chunck by chunck withfopen(),fread().The last option is prefered because it will only take the necessary amount of memory when processing a bunch of lines.
clear()the object you already processed, else it will be kept in memory until the end of your script. This means that if one DC2 object uses 1Mo of memory, and you have 100,000 objects, at the end of your scripts it will use 100,000Mo.So batching your insert by range of 2,500 is quite a good idea but you obviously need to remove the processed object from the EntityManager.
It can be done using
$entityManager->clear();The
clear()will clear the wholeEntityManager, you want to clear a single entity, you may use$entityManager->detach($object).If you want to profile your memory usage you may also be interested in the functions
memory_get_usage()memory_get_peak_usage()