I am currently developing an application that processes several files, containing around 75,000 records a piece (stored in binary format). When this app is ran (manually, about once a month), about 1 million records are contained entirely with the files. Files are put in a folder, click process and it goes and stores this into a MySQL database (table_1)
The records contain information that needs to be compared to another table (table_2) containing over 700k records.
I have gone about this a few ways:
METHOD 1: Import Now, Process Later
In this method, I would import the data into the database without any processing from the other table. However when I wanted to run a report on the collected data, it would crash assuming memory leak (1 GB used in total before crash).
METHOD 2: Import Now, Use MySQL to Process
This was what I would like to do but in practice it didn’t seem to turn out so well. In this I would write the logic in finding the correlations between table_1 and table_2. However the MySQL result is massive and I couldn’t get a consistent output, sometimes causing MySQL giving up.
METHOD 3: Import Now, Process Now
I am currently trying this method and although the memory leak is subtle, It still only gets to about 200,000 records before crashing. I have tried numerous forced garbage collections along the way, destroying properly classes, etc. It seems something is fighting me.
I am at my wits end trying to solve the issue with memory leaking / the app crashing. I am no expert in Java and have yet to really deal with very large amounts of data in MySQL. Any guidance would be extremely helpful. I have put thought into these methods:
- Break each line process into individual class, hopefully expunging any memory usage on each line
- Some sort of stored routine where once a line is stored into the database, MySQL does the table_1 <=> table_2 computation and stores the result
But I would like to pose the question to the many skilled Stack Overflow members to learn properly how this should be handled.
I concur with the answers that say “use a profiler”.
But I’d just like to point out a couple of misconceptions in your question:
The storage leak is not due to massive data processing. It is due to a bug. The “massiveness” simply makes the symptoms more apparent.
Running the garbage collector won’t cure a storage leak. The JVM always runs a full garbage collection immediately before it decides to give up and throw an OOME.
It is difficult to give advice on what might actually be causing the storage leak without more information on what you are trying to do and how you are doing it.