I have a program that takes each item from a list and compares it against all other items in another list. Its been working fine so far but the data is getting large and is going to exceed system memory.
I’m wondering what the best way to compare two lists that are very large(maybe 5-10 GB each list)?
Here is a very simple example of what I’m doing(except the list is huge and the values in the for loop are actually being processed/compared).
import java.util.Collection;
import java.util.HashSet;
import java.util.Arrays;
public class comparelists {
public static void main( String [] args ) {
String[] listOne = {"a","b",
"c","d",
"e","f",
"g","h",
"i","j",
"k","l"};
String[] listTwo = {"one",
"two",
"three",
"four",
"five","six","seven"};
for(int listOneItem=0; listOneItem<listOne.length; listOneItem++){
for (int listTwoItem=0; listTwoItem<listTwo.length; listTwoItem++) {
System.out.println(listOne[listOneItem] + " " + listTwo[listTwoItem]);
}
}
}
}
I realize there has to be some disk IO here since it won’t fit in memory and my intial approach was to save both lists as files and save a bunch of lines from listOne then stream the entire file of listTwo and then get some more lines from listOne and so on. Is there a better way? or a Java way to access the lists like I’m doing above but its swapping to disk as needed?
You could put the Big Data in flat files and then stream one item of data in at a time from the files. This way only two items of data are in memory at any given time.
Obviously this isn’t going to win any efficiency awards, but here’s a simple example that uses data files which contain one item per line in text files:
If the data you’re working with is too complicated to easily store one item per line in a text file, you could do a similar thing with ObjectInputStream and ObjectOutputStream, which can read and write one Java object at a time to a file.
If you can manage to fit listB in memory, then obviously you’d save quite a bit of disk access inside the first loop. Memoization might help you fit listB into memory if you have enough duplicate data.
Also the comparison of items is a textbook example a problem that could be sped up by using parallelization. E.g. hand the data comparison work off to worker threads so that the file-read thread can focus on maximizing throughput from the disk.