I’m currently looking for a way to optimize a process that
takes a long time to run.
- There are about 270 text files to be filtered.
- Each file has about 70k~150k lines.
- The reference table has usually about 16m records under Oracle 10g.
- The process is run every hour.
- There’s a possibility that 9 instances of that process may be run almost
simultaneaously.
What I currently do is spool the reference table into a file, copy that into
a hash, do the same with the text file, then do a hash key match up.
Any record on the text file found on the reference list will be discarded.
This gets repeated for all 270 files, however the spooling part is only done
once at the start.
However this approach consumes about 300mb~500mb of RAM, and with the
possibility of having multiple instances of that process running almost at
the same time, its nightmare to our server.
Any ideas how to do this better?
I’d suggest you to only load DB data to the memory, and to process files like (sorry, that’s a pseudocode, but you should get an idea and implement it in perl):
This is going to take only about as much RAM as your reference table takes.