I have a Map<byte[], Element> and I want to sort it and write it to disk, so that I have a file with all the elements sorted by key through Guava’s UnsignedBytes.lexicographicalComparator.
What I’m doing right now is:
HashMap<byte[], Element> memory;
// ... code creating and populating memory ...
TreeMap<byte[], Element> sortedMap = new TreeMap<byte[], Element>(UnsignedBytes.lexicographicalComparator());
sortedMap.putAll(memory.getMap());
MyWriter writer = new MyWriter("myfile.dat");
for (Element element: sortedMap.values())
writer.write(element);
writer.close();
It’s probably difficult to make the sorting faster (O(nlogn)), the question is whether I can improve on the navigation of the sorted list. Ideally I’d sort into an ArrayList instead of a TreeMap, so that iterating through it would be very fast.
I thought about putting the HashMap into an ArrayList and Collections.sort() it, but that would require more copying than the actual solution.
Any ideas?
Edit:
I add here my test with ArrayList which is 2x faster, but I assume it uses more memory. Maybe some comments on this assumption?
// ArrayList-based implementation 2x faster
ArrayList<Element> sorted = new ArrayList<Element>(memory.size());
sorted.addAll(memory.values());
final Comparator<byte[]> lexic = UnsignedBytes.lexicographicalComparator();
Collections.sort(sorted, new Comparator<Element>(){
public int compare(Element arg0, Element arg1) {
return lexic.compare(arg0.getKey(), arg1.getKey());
}
});
MyWriter writer = new MyWriter(filename);
for (Element element: sorted)
writer.write(element);
writer.close();
Your question was “Any ideas?”. I guess anything I could write would be an answer.
I had the same problem as you, and extensively benchmarked the two solutions: use a treemap so items were sorted in advance, or sort them after the fact. My benchmark showed the same result as yours. It’s faster to sort after the fact.
I wouldn’t be concerned about the fact that the second approach requires more copying. First, faster is faster, right? If the second approach takes fewer CPU cycles then it’s better.
If memory is a concern, then keep in mind that treemaps and hashmaps take far more memory per item than an ArrayList, which is backed by a simple object array. Each element in a treemap or hashmap requires at least one object, and usually more. Objects have a lot of overhead, 32 or more bytes. In a flat array each element takes only 4 bytes.
My benchmarks showed that the time to allocate an array from memory was roughly proportional to the size of the array, once you got to an array size over a few dozen bytes. So allocating the ArrayList may be slow if it’s really large. Still, I think it’s the better bet, so long as there’s no danger of running out of memory.