I have about a 100 csv files each 100,000 x 40 rows columns. I’d like to do some statistical analysis on it, pull out some sample data, plot general trends, do variance and R-square analysis, and plot some spectra diagrams. For now, I’m considering numpy for the analysis.
I was wondering what issues should I expect with such large files? I’ve already checked for erroneous data. What are your recommendations on doing statistical analysis? would it be better if I just split the files and do the whole thing in Excel?
I’ve found that Python + CSV is probably the fastest, and simplest way to do some kinds of statistical processing.
We do a fair amount of reformatting and correcting for odd data errors, so Python helps us.
The availability of Python’s functional programming features makes this particularly simple. You can do sampling with tools like this.
I really like being able to compose more complex functions from simpler functions.