I have implemented a java program . This is basically a multi threaded service with fixed number of threads. Each thread takes one task at a time, create a hashSet , the size of hashset can vary from 10 to 20,000+ items in a single hashset. At end of each thread, the result is added to a shared collection List using synchronized.
The problem happens is at some point I start getting out of memory exception. Now after doing bit of research, I found that this memory exception occurs when GC is busy clearing the memory and at that point it stops the whole world to execute anything.
Please give me suggestions for how to deal with such large amount of data. Is Hashset a correct datastructure to be used? How to deal with memory exception, I mean one way is to use System.GC(), which is again not good as it will slow down the whole process. Or is it possible to dispose the “HashSet hsN” after I add it to the shared collection List?
Please let me know your thoughts and guide me for wherever I am going wrong. This service is going to deal with huge amout of data processing.
Thanks
//business object - to save the result of thread execution
public class Location{
integer taskIndex;
HashSet<Integer> hsN;
}
//task to be performed by each thread
public class MyTask implements Runnable {
MyTask(long task) {
this.task = task;
}
@Override
public void run() {
HashSet<Integer> hsN = GiveMeResult(task);//some function calling which returns a collection of integer where the size vary from 10 to 20000
synchronized (locations) {
locations.add(task,hsN);
}
}
}
public class Main {
private static final int NTHREDS = 8;
private static List<Location> locations;
public static void main(String[] args) {
ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
for (int i = 0; i < 216000; i++) {
Runnable worker = new MyTask(i);
executor.execute(worker);
}
// This will make the executor accept no new threads
// and finish all existing threads in the queue
executor.shutdown();
// Wait until all threads are finish
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
}
}
For such implementation is JAVA a best choice or C# .net4?
A couple of issues that I can see:
You synchronize on the
MyTaskobject, which is created separately for each execution. You should be synchronizing on a shared object, preferably the one that you are modifying i.e. thelocationsobject.216,000 runs, multiplied by say 10,000 returned objects each, multiplied by a minimum of 12 bytes per
Integerobject is about 24 GB of memory. Do you even have that much physical memory available on your computer, let alone available to the JVM?32-bit JVMs have a heap size limit of less than 2 GB. On a 64-bit JVM on the other hand, an
Integerobject takes about 16 bytes, which raises the memory requirements to over 30 GB.With these numbers it’s hardly surprising that you get an
OutOfMemoryError…PS: If you do have that much physical memory available and you still think that you are doing the right thing, you might want to have a look at tuning the JVM heap size.
EDIT:
Even with 25GB of memory available to the JVM it could still be pushing it:
Each
Integerobject requires 16 bytes on modern 64-bit JVMs.You also need an 8-byte reference that will point to it, regardless of which
Listimplementation you are using.If you are using a linked list implementation, each entry will also have an overhead of at least 24 bytes for the list entry object.
At best you could hope to store about 1,000,000,000
Integerobjects in 25GB – half that if you are using a linked list. That means that each task could not produce more than 5,000 (2,500 respectively) objects on average without causing an error.I am unsure of your exact requirement, but have you considered returning a more compact object? For example an
int[]array produced from eachHashSetwould only keep the minimum of 4 bytes per result without the object container overhead.EDIT 2:
I just realized that you are storing the
HashSetobjects themselves in the list.HashSetobjects use aHashMapinternally which then uses aHashMap.Entryobject of each entry. On an 64-bit JVM the entry object consumes about 40 bytes of memory in addition to the stored object:The key reference which points to the
Integerobject – 8 bytes.The value reference (always
nullin a HashSet) – 8 bytes.The next entry reference – 8 bytes.
The hash value – 4 bytes.
The object overhead – 8 bytes.
Object padding – 4 bytes.
I.e. for each
Integerobject you need 56 bytes for storage in aHashSet. With the typicalHashMapload factor of 0.75, you should add another 10 or bytes for theHashMaparray references. With 66 bytes perIntegeryou can only store about 400,000,000 such objects in 25 GB, without taking into account the rest of your application any any other overhead. That’s less than 2,000 object per task…EDIT 3:
You would be better off storing a sorted
int[]array instead of aHashSet. That array is searchable in logarithmic time for any arbitrary integer and minimizes the memory consumption to 4 bytes per number. Considering the memory I/O it would also be as fast (or faster) as theHashSetimplementation.