I am working on a research project in big data mining. I have written the code currently to organize the data I have into a dictionary. However, The amount of data is so huge that while forming the dictionary, my computer runs out of memory. I need to periodically write my dictionary to main memory and create multiple dictionaries this way. I then need to compare the resulting multiple dictionaries, update the keys and values accordingly and store the whole thing in one big dictionary on disk. Any idea how I can do this in python? I need an api that can quickly write a dict to disk and then compare 2 dicts and update keys. I can actually write the code to compare 2 dicts, that’s not a problem but I need to do it without running out of memory..
My dict looks like this:
“orange” : [“It is a fruit”,”It is very tasty”,…]
Agree with Hoffman: go for a relational database. Data-processing is a bit of an unusual task for a relational engine, but believe, it is a good compromise between easy of use/deployment and speed for large datasets.
I customarily use sqlite3, that comes just with Python, although more often I use it through apsw. The advantage of a relational engine like sqlite3 is that you can instruct it to do a lot of processing with your data through joins and updates, and it will take care of all the memory/disk swapping of data required, in quite a sensible manner. You can also use in-memory databases to hold small data which you need interacting with your big data, and have them linked through “ATTACH” statements. I have processed gigabytes this way.