I have 2 big CSV file with millions of rows.
Because those 2 CSVs are from MySQL, I want to merge those 2 tables into one Document in couch DB.
What is the most efficient way to do this?
My current method is:
- import 1st CSV
- import 2nd CSV
To prevent duplication, the program will search the Document with the key for each row. after the row is found, then the Document is updated with the columns from the 2nd CSV
The problem is, it really takes a long time to search for each row.
while importing the 2nd CSV, it updates 30 documents/second and I have about 7 million rows. rough calculation, it will take about 64 hours to complete the whole importing.
Thank You
Another option is simply to do what you are doing already, but use the bulk document API to get a performance boost.
For each batch of documents:
POST to
/db/_all_docs?include_docs=truewith a body like this:Build your
_bulk_docsupdate depending on the results you get.{"key":"some_id_1", "doc": {"existing":"data"}}{"key":"some_id_2", "error":"not_found"}POST to
/db/_bulk_docswith a body like this: