I’m having a problem with multi-threading in Java. I need to compare a large list of names to itself (to find near-duplicates).
I’ve split up the work into 4 different threads, each comparing 1/4 of the list to the complete list. I use the same class for all 4 threads.
When I look at the thread monitor I see that they are not really running concurrently, they are active one after another.
what could be the problem?
This is the run-method of my thread-class:
@Override
public void run() {
try {
s = settings.conn.createStatement();
JaroWinklerDistance jw = JaroWinklerDistance.JARO_WINKLER_DISTANCE;
for (int i = 0; i < names.size(); i++) {
for (int j = 0; j < allNames.size(); j++) {
if (j % 250 == 0) {
}
double proximity = jw.proximity(names.get(i), allNames.get(j));
if (proximity > Double.parseDouble(settings.properties.getProperty("distanceTreshold")) && proximity < 1.00) {
if (names.get(i).length() > allNames.get(j).length()) {
substituteName(allNames.get(j), names.get(i));
allNames.remove(allNames.get(j));
} else {
substituteName(names.get(i), allNames.get(j));
names.remove(names.get(i));
break;
}
}
}
}
} catch (SQLException ex) {
Exceptions.printStackTrace(ex);
}
}
The substituteName-method executes an SQL-query that updates the records.
The threads are created as follows:
settings.getAllNames();
int size = settings.allNames.size();
int rest = size % 4;
int groupSize = (size-rest) / 4;
GroupNormalizer a = new GroupNormalizer(settings.allNames, new ArrayList<String>(settings.allNames.subList(0, groupSize)));
GroupNormalizer b = new GroupNormalizer(settings.allNames, new ArrayList<String>(settings.allNames.subList(groupSize, (groupSize*2))));
GroupNormalizer c = new GroupNormalizer(settings.allNames, new ArrayList<String>(settings.allNames.subList((groupSize * 2), (groupSize * 3))));
GroupNormalizer d = new GroupNormalizer(settings.allNames, new ArrayList<String>(settings.allNames.subList((groupSize * 3), (groupSize*4 + rest))));
a.start();
b.start();
c.start();
d.start();
EDIT: all 4 threads alternate a lot between running and monitor (blocked)-status
hmm it look like this line is causing synchronization lockup:
try to pull the Double.parseDouble out of the loop since everything in there looks kind of constant to me.
Seems like the settings object is blocking ob access and in this way slowing you down.
Also it looks like you are accessing a DB during your claculation (catching SQLEx), this will slow you down by a very large factor. Try to separate read and write from the claculation process.