I’ve been trying to figure out an efficient way to work without reading twice from a file when using a buffer to search for a pattern of bytes. I’ve chosen to implement Runnable so I can divide my task to work in concurrent threads. My code looks something like this:
// constructor: initializes local variables.
public BytePatternSearcher(RandomAccessFile raFile, byte[] pattern, int bufferSize, long startPos, long endPos);
public void run()
{
while(amountToRead - raFile.read(buffer) > 0)
{
// search code
}
{
Now, I’ve hit a snag: my algorithm works in simple cases, but not in complex ones. I made the assumption that there are no cases of a pattern starting within one already being searched, that the pattern length is shorter than the buffer, and so on, limiting to one scan at a time and just iterating through the file. Naturally, this is not a very robust solution. Suppose I have a pattern of ‘xxxxx’ (length 5), my file is ‘xxxxxxyxxxxxx’ and my buffer size is 2 (x and y represent certain byte values). The string appears 4 times, and each check requires more than twice the buffer length.
How do I go about working things out without checking the same byte more than once for all cases?
The wikipedia has an entry for Boyer–Moore string search algorithm which also contains some sample implementations.