EDIT: Link should work now, sorry for the trouble
I have a text file that looks like this:
Name, Test 1, Test 2, Test 3, Test 4, Test 5 Bob, 86, 83, 86, 80, 23 Alice, 38, 90, 100, 53, 32 Jill, 49, 53, 63, 43, 23.
I am writing a program that given this text file, it will generate a Pearson’s correlation coefficient table that looks like this where the entry (x,y) is the correlation between person x and person y:
Name,Bob,Alice,Jill Bob, 1, 0.567088412588577, 0.899798494392584 Alice, 0.567088412588577, 1, 0.812425393004088 Jill, 0.899798494392584, 0.812425393004088, 1
My program works, except that the data set I am feeding it has 82 columns and, more importantly, 54000 rows. When I run my program right now, it is incredibly slow and I get an out of memory error. Is there a way I can first of all, remove any possibility of an out of memory error and maybe make the program run a little more efficiently? The code is here: code.
Thanks for your help,
Jack
Edit: In case anyone else is trying to do large scale computation, convert your data into hdf5 format. This is what I ended up doing to solve this issue.
You’re going to have to do at least 54000^2*82 calculations and comparisons. Of course it’s going to take a lot of time. Are you holding everything in memory? That’s going to be pretty large too. It will be slower, but it might use less memory if you can keep the users in a database and calculate one user against all the others, then go on to the next and do it against all the others instead of one massive array or hash.