I have a ~700 MB binary file (non-text data); what I would like to do is search for a specific pattern of bytes that occurs in random locations throughout the file. e.g. 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 0x? 0x? 0x55 and so on for 50 or so bytes in sequence. The pattern I’d be searching for would be a sequence two random bytes with 0x55 occurring every two bytes.
That is, search for tables stored in the file with 0x55 being the delimiter, and then save the data contained in the tables or otherwise manipulate it.
Would the best option be simply going through every individual byte one at a time, and then looking ahead two bytes to see if the value is 0x55, and if it is, then looking ahead again and again to confirm that a table exists in that location?
Load the whole thing? fseek? Buffer chunks, searching those one byte at a time?
What would be the best way of looking through this large file, and finding the pattern, using C or C++?
What ultimately worked for me was a hybrid between the Boyer-Moore-Horspool algorithm (suggested by Jerry Coffin) and my own algorithm based on the structure of the tables and the data being stored.
Basically, the BMH algorithm caught most of the things I was looking for. The obvious stuff.
But some tables did turn out to have odd formatting, and I had to implement a semi-intelligent search that would look at the data following each
0x55, and figure out whether or not it was it was likely to be good data, or just random junk.Oddly enough, I ended up implementing it in PHP rather than C++, and dumping the results right into a MySQL database for querying. The search process only took around 5 minutes or less, and the results were largely good. I did end up with a lot of junk data, but it caught everything that I needed it to, and (as far as I’m aware) did not leave any good data behind.