I am currently experiencing a strange behavior in this application I am building.
Preface
This application I am building has a simple goal — to take a collection of strings and search for each of those strings across multiple text files. The application also tracks unique matches for each string, i.e. string "abcd" will only be counted once if it appears n-times in file A.
Since this application will mainly be dealing with large numbers of files and large number of strings, I decided to do the string search in the background by creating a class that implements Runnable and using a ExecutorService to run the Runnable task. I also decided to investigate the speediness of the string search, so I started comparing the times using different methods of string matching (i.e. String.contains(), String.indexOf(), Boyer-Moore algorithm). I grabbed the source code of the Boyer-Moore algorithm from http://algs4.cs.princeton.edu/53substring/BoyerMoore.java.html and included it into my project. Here is where the problem started…
The Problem
I noticed that the string search would come back with varying results (each time I would run the search, the number of found strings would vary) when using the BoyerMoore class so I replaced it with a String.contains() so that the code looks like the following…
private boolean findStringInFile(String pattern, File file) {
boolean result = false;
BoyerMoore bm = new BoyerMoore(pattern); // This line still causes varying results.
try {
Scanner in = new Scanner(new FileReader(file));
while(in.hasNextLine() && !result) {
String line = in.nextLine();
result = line.contains(pattern);
}
in.close();
} catch (FileNotFoundException e) {
System.out.println("ERROR: " + e.getMessage());
System.exit(0);
}
return result;
}
Even with the above code, the results were still inconsistent. It seems like the instantiation of the BoyerMoore object is causing the results to vary. I dug a little deeper and found that the following code in the BoyerMoore constructor was causing this inconsistency…
// position of rightmost occurrence of c in the pattern
right = new int[R];
for (int c = 0; c < R; c++)
right[c] = -1;
for (int j = 0; j < pat.length(); j++)
right[pat.charAt(j)] = j;
Now I know what was causing the inconsistency but I still do not understand why it was happening. I’m no veteran when it comes to multi-threading so any possible explanation/insight is greatly appreciated!
Below is the full code for the search task…
private class Search implements Runnable {
private File mSearchableFile;
private ConcurrentHashMap<String,Integer> mTable;
public Search(File file,ConcurrentHashMap<String,Integer> table) {
mSearchableFile = file;
mTable = table;
}
@Override
public void run() {
Iterator<String> nodeItr = mTable.keySet().iterator();
while(nodeItr.hasNext()) {
String currentString = nodeItr.next();
if(findStringInFile(currentString , mSearchableFile)) {
Integer count = mTable.get(currentString) + 1;
mTable.put(currentString,count);
}
}
}
private boolean findStringInFile(String pattern, File file) {
boolean result = false;
// BoyerMoore bm = new BoyerMoore(pattern);
try {
Scanner in = new Scanner(new FileReader(file));
while(in.hasNextLine() && !result) {
String line = in.nextLine();
result = line.contains(pattern);
}
in.close();
} catch (FileNotFoundException e) {
System.out.println("ERROR: " + e.getMessage());
System.exit(0);
}
return result;
}
}
This should perform better as
This gets the matches for each file and accumulates the count in a single thread.
These lines are not thread safe. Any number of threads can be updating the same key so the result will not be safe.
A simple workaround is to use AtomicInteger (it will also simplify your code)