I do data mining research and often have Python scripts that load large datasets from SQLite databases, CSV files, pickle files, etc. In the development process, my scripts often need to be changed and I find myself waiting 20 to 30 seconds waiting for data to load.
Loading data streams (e.g. from a SQLite database) sometimes works, but not in all situations — if I need to go back into a dataset often, I’d rather pay the upfront time cost of loading the data.
My best solution so far is subsampling the data until I’m happy with my final script. Does anyone have a better solution/design practice?
My “ideal” solution would involve using the Python debugger (pdb) cleverly so that the data remains loaded in memory, I can edit my script, and then resume from a given point.
One way to do this would be to keep your loading and manipulation scripts in separate files X and Y and have
X.pyreadWhen you’re coding
X.py, you omit this part from the file and manually run it in an interactive shell. Then you can modifyX.pyand do animport Xin the shell to test your code.