I have the need to compare very large file-based strings of equal length for simple equality, without first calculating a hash.
I want to use the data in the string to make large, seemingly random jumps, so that I can quickly determine a test for inequality even in strings that start and end the same way. That is, I want to jump throughout the range, in some way that mostly or completely avoids hitting the same character too many times.
Since the strings are file-based and very large, I don’t want my jumps to be too large because that will thrash the disk.
In my program, a string is simple a sequence of chars backed by a file and less than 2gig in size, but rarely completely in memory at once.
Then after trying for awhile I assume they are equal and I just iterate in order.
My string class variations all have a base interface of int length() and char charAt() functions, assuming java chars, which are usually but not always ascii.
Any ideas,
Andy
Build some meta data about your giant strings.
Let’s say you have them split into logical pages or blocks. You pick a block size and when you load a block into memory you hash it, storing this hash in a lookup table.
When you go to compare two files, you can first compare known hashes of subsections before going to disk to get more.
This should give you a good balance of caching and removing the need for disk access, without giving you too much overhead.