Does anyone know of an optimized way of detecting a 37 bit sequence in a chunk of binary data that is optimal. Sure I can do a brute force compare using windowing (just compare starting with index 0+next 36 bits, increment and loop until i find it) but is there a better way? Maybe some hashing search that returns a probability that the sequence lies within a binary chunk? Or am I just pulling that out of my butt? Anyway, I’m going ahead with the brute force search, but I was curious if there was something more optimal. This is in C by the way.
Share
Interesting question. I assume your 37-bit sequence can begin at any point in a byte. Let’s say your sequence is represented by this:
If we have a byte aligned algorithm, we could see these 32-bit sequences bytes:
Only these byte values – no others -could form the second third and fourth byte of a byte sequence containing the 37 bits of interest.
This leads to a reasonably obvious implementation:
unsigned char *p = ...; // input data size_t n = ...; // bytes available size_t bitpos; --n; p++; bitpos = 0; while (n--) { uint32_t word = *(uint32_t*)p; // nonportable, sorry. bitpos += 8; // compiler should be able to optimise this variable out completely if (word == w_A) { if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) { // we found the data starting at the 8th bit of p-1 found_at(bitpos-1); } } else if (word == w_B) { if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) { // we found the data starting at the 7th bit of p-1 found_at(bitpos-2); } } else if (word == w_C} { ... } ... }Obviously there are problems with this strategy. First, it might want to evaluate p[-1] first time around the loop, but that’s easy to fix. Second, it fetches a word from odd addresses; that wont work on some CPUs – SPARC and 68k for example. But doing so is an easy way to roll 4 comparisons into one.
kek444’s suggestion would allow you to use a algorithm like KMP to skip forward in the data stream. However, the maximum size of the skip is not huge, so while the Turbo Boyer-Moore algorithm may reduce the number of byte comparisons by 4 or so, that may not be much of a win if the cost of a byte comparison is similar to the cost of a word comparision.