In Java, using the following function for a huge matrix X to print its column-distinct elements:
// create the list of distinct values
List<Integer> values = new ArrayList<Integer>();
// X is n * m int[][] matrix
for (int j = 0, x; j < m; j++) {
values.clear();
for (int i = 0; i < n; i++) {
x = X[i][j];
if (values.contains(x)) continue;
System.out.println(x);
values.add(x);
}
}
First I iterate by columns (index j) and inside by rows (index i).
This function will be called millions of times for different matrices, so the code should be optimized to meet the performance requirements. I’m wondering about the values array. Would it be faster to use values = new ArrayList<Integer>(); or values = null instead of values.clear() ?
What would be much more efficient would be to use a Set instead of a list, for example the HashSet implementation. The contains method will run in O(1) instead of O(n) with a list. And you could save one call by only calling the add method.
As for your specific question, I would just create a new Set at each loop – object creation is not that expensive, probably less than clearing the set (as confirmed by the benchmark at the bottom – see the most efficient version in EDIT 2):
However, the only way to know which is quicker (new object vs. clear) is to profile that portion of your code and check the performance of both versions.
EDIT
I ran a quick benchmark and the clear version seems a little faster than creating a set at each loop (by about 20%). You should still check on your dataset / use case which one is better. Faster code with my dataset:
EDIT 2
An actually even faster version of the code is obtained by creating a new set of the right size at each loop:
Summary of result
After JVM warm up + JIT: