I’m building a list of hashes that represent root to node paths in a tree. My functions work but they are incredibly slow over large tree structures – is there a better way? I’ve tried building the list in one function but I get unique hashes where I don’t want them.
public ArrayList<Integer> makePathList(AbstractTree<String> tree){
StringBuilder buffer = new StringBuilder();
ArrayList<Integer> pl = new ArrayList<Integer>();
ArrayList<StringBuilder> paths = getPaths(tree, buffer);
for(StringBuilder sb : paths){
pl.add(sb.toString().hashCode());
}
return pl;
}
public ArrayList<StringBuilder> getPaths(AbstractTree<String> tree, StringBuilder parent){
ArrayList<StringBuilder> list = new ArrayList<StringBuilder>();
parent.append("/");
parent.append(tree.getNodeName());
list.add(new StringBuilder(parent));
if (!tree.isLeaf()){
int i = 0;
Iterator<AbstractTree<String>> child = tree.getChildren().iterator();
while (i < tree.getChildren().size()){
list.addAll(getPaths(child.next(), new StringBuilder(parent)));
i++;
}
}
return list;
}
UPDATE:
Marcin’s suggestion to make the hash during tree traversal gives the wrong answer, but perhaps that is the way I have done it?
public ArrayList<Integer> getPaths(AbstractTree<String> tree, StringBuilder parent){
ArrayList<Integer> list = new ArrayList<Integer>();
parent.append("/");
parent.append(tree.getNodeName());
list.add(new StringBuilder(parent).toString().hashCode());
if (!tree.isLeaf()){
int i = 0;
Iterator<AbstractTree<String>> child = tree.getChildren().iterator();
while (i < tree.getChildren().size()){
list.addAll(getPaths(child.next(), new StringBuilder(parent)));
i++;
}
}
return list;
}
I think your main problem is the amount of duplicate data you are producing: for every single leaf of the tree, you will make a copy of the entire path leading up to that leaf and compute the hash for that path. i.e. if you have 50,000 leaves under one top-level node, then the path name of that node will be copied 50,000 times and its hash computed 50,000 times.
If you could organize your data so that shared path prefixes are reused as references between leaves and hash calculations for these prefixes are cached and reused, you could drastically reduce the actual amount of work to be done.