This is a parallel implementation of Levenshtein distance that I was writing for fun. I’m disappointed in the results. I am running this on a core i7 processor, so I have plenty of available threads. However, as I increase the thread count, the performance degrades significantly. By that I mean it actually runs slower with more threads for input of the same size.
I was hoping that someone could look at the way I am using threads, and the java.util.concurrent package, and tell me if I am doing anything wrong. I’m really only interested in reasons why the parallelism is not working as I would expect. I don’t expect the reader to look at the complicated indexing going on here. I believe the calculations I’m doing are correct. But even if they are not, I think I should still be seeing a close to linear speed-up as I increase the number of threads in the threadpool.
I’ve included the benchmarking code I used. I’m using libraries found here for benchmarking. The second code block is what I used for benchmarking.
Thanks for any help :).
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class EditDistance {
private static final int MIN_CHUNK_SIZE = 5;
private final ExecutorService threadPool;
private final int threadCount;
private final String maxStr;
private final String minStr;
private final int maxLen;
private final int minLen;
public EditDistance(String s1, String s2, ExecutorService threadPool,
int threadCount) {
this.threadCount = threadCount;
this.threadPool = threadPool;
if (s1.length() < s2.length()) {
minStr = s1;
maxStr = s2;
} else {
minStr = s2;
maxStr = s1;
}
maxLen = maxStr.length();
minLen = minStr.length();
}
public int editDist() {
int iterations = maxLen + minLen - 1;
int[] prev = new int[0];
int[] current = null;
for (int i = 0; i < iterations; i++) {
int currentLen;
if (i < minLen) {
currentLen = i + 1;
} else if (i < maxLen) {
currentLen = minLen;
} else {
currentLen = iterations - i;
}
current = new int[currentLen * 2 - 1];
parallelize(prev, current, currentLen, i);
prev = current;
}
return current[0];
}
private void parallelize(int[] prev, int[] current, int currentLen,
int iteration) {
int chunkSize = Math.max(current.length / threadCount, MIN_CHUNK_SIZE);
List<Future<?>> futures = new ArrayList<Future<?>>(currentLen);
for (int i = 0; i < currentLen; i += chunkSize) {
int stopIdx = Math.min(currentLen, i + chunkSize);
Runnable worker = new Worker(prev, current, currentLen, iteration,
i, stopIdx);
futures.add(threadPool.submit(worker));
}
for (Future<?> future : futures) {
try {
Object result = future.get();
if (result != null) {
throw new RuntimeException(result.toString());
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} catch (ExecutionException e) {
// We can only finish the computation if we complete
// all subproblems
throw new RuntimeException(e);
}
}
}
private void doChunk(int[] prev, int[] current, int currentLen,
int iteration, int startIdx, int stopIdx) {
int mergeStartIdx = (iteration < minLen) ? 0 : 2;
for (int i = startIdx; i < stopIdx; i++) {
// Edit distance
int x;
int y;
int leftIdx;
int downIdx;
int diagonalIdx;
if (iteration < minLen) {
x = i;
y = currentLen - i - 1;
leftIdx = i * 2 - 2;
downIdx = i * 2;
diagonalIdx = i * 2 - 1;
} else {
x = i + iteration - minLen + 1;
y = minLen - i - 1;
leftIdx = i * 2;
downIdx = i * 2 + 2;
diagonalIdx = i * 2 + 1;
}
int left = 1 + ((leftIdx < 0) ? iteration + 1 : prev[leftIdx]);
int down = 1 + ((downIdx < prev.length) ? prev[downIdx]
: iteration + 1);
int diagonal = penalty(x, y)
+ ((diagonalIdx < 0 || diagonalIdx >= prev.length) ? iteration
: prev[diagonalIdx]);
int dist = Math.min(left, Math.min(down, diagonal));
current[i * 2] = dist;
// Merge prev
int mergeIdx = i * 2 + 1;
if (mergeIdx < current.length) {
current[mergeIdx] = prev[mergeStartIdx + i * 2];
}
}
}
private int penalty(int maxIdx, int minIdx) {
return (maxStr.charAt(maxIdx) == minStr.charAt(minIdx)) ? 0 : 1;
}
private class Worker implements Runnable {
private final int[] prev;
private final int[] current;
private final int currentLen;
private final int iteration;
private final int startIdx;
private final int stopIdx;
Worker(int[] prev, int[] current, int currentLen, int iteration,
int startIdx, int stopIdx) {
this.prev = prev;
this.current = current;
this.currentLen = currentLen;
this.iteration = iteration;
this.startIdx = startIdx;
this.stopIdx = stopIdx;
}
@Override
public void run() {
doChunk(prev, current, currentLen, iteration, startIdx, stopIdx);
}
}
public static void main(String args[]) {
int threadCount = 4;
ExecutorService threadPool = Executors.newFixedThreadPool(threadCount);
EditDistance ed = new EditDistance("Saturday", "Sunday", threadPool,
threadCount);
System.out.println(ed.editDist());
threadPool.shutdown();
}
}
There is a private inner class Worker inside EditDistance. Each worker is responsible for filling in a range of the current array using EditDistance.doChunk. EditDistance.parallelize is responsible for creating those workers, and waiting for them to finish their tasks.
And the code I am using for benchmarks:
import java.io.PrintStream;
import java.util.concurrent.*;
import org.apache.commons.lang3.RandomStringUtils;
import bb.util.Benchmark;
public class EditDistanceBenchmark {
public static void main(String[] args) {
if (args.length != 2) {
System.out.println("Usage: <string length> <thread count>");
System.exit(1);
}
PrintStream oldOut = System.out;
System.setOut(System.err);
int strLen = Integer.parseInt(args[0]);
int threadCount = Integer.parseInt(args[1]);
String s1 = RandomStringUtils.randomAlphabetic(strLen);
String s2 = RandomStringUtils.randomAlphabetic(strLen);
ExecutorService threadPool = Executors.newFixedThreadPool(threadCount);
Benchmark b = new Benchmark(new Benchmarker(s1, s2, threadPool,threadCount));
System.setOut(oldOut);
System.out.println("threadCount: " + threadCount +
" string length: "+ strLen + "\n\n" + b);
System.out.println("s1: " + s1 + "\ns2: " + s2);
threadPool.shutdown();
}
private static class Benchmarker implements Runnable {
private final String s1, s2;
private final int threadCount;
private final ExecutorService threadPool;
private Benchmarker(String s1, String s2, ExecutorService threadPool, int threadCount) {
this.s1 = s1;
this.s2 = s2;
this.threadPool = threadPool;
this.threadCount = threadCount;
}
@Override
public void run() {
EditDistance d = new EditDistance(s1, s2, threadPool, threadCount);
d.editDist();
}
}
}
It’s very easy to accidentally write code that does not parallelize very well. A main culprit is when your threads compete for underlying system resources (e.g. a cache line). Since this algorithm inherently acts on things that are close to each other in physical memory, I suspect pretty strongly that may be the culprit.
I suggest you review this excellent article on False Sharing
http://www.drdobbs.com/go-parallel/article/217500206?pgno=3
and then carefully review your code for cases where threads would block one another.
Additionally, running more threads than you have CPU cores will slow down performance if your threads are CPU bound (if you’re already using all cores to near 100%, adding more threads will only add overhead for the context switches).