I’m reading in a file which I can’t buffer all at once, as its size ranges from 256MB upto ~2GB in size.
Once the file is opened, I read a chunk of it into a byte array, say 512 bytes, convert it into a String and run a regex over it, and if the pattern is detected, my program makes a note of it.
The problem I’m having is that my program is missing a lot of the locations in the file where it should be detecting the pattern.
I’m 90% sure the problem is that whilst the pattern is there, it’s not complete, as it’s going outside the length of the buffer. The pattern I’m looking for is eight bytes in length, so for example, the first four bytes of the pattern are at the last four positions in the array; so when it is filled again, the first four bytes of the array are the last four of the pattern. As such, my regex always fails.
I’m guessing what I need to do is to fill the buffer, then when it fills it again, retain the last 20 or so bytes in there, so that it will not miss any of the patterns I’m looking for.
Any advice would be greatly appreciated. Thanks in advance.
Tony
First of all, you can’t apply a Java regular expression to a byte array. You have to apply it to a
String. So, you must be converting frombyte[]toString, and you may (a) be using the wrong encoding, or (b) truncating the string in the middle.Once you’ve gotten past this, you need to have use a streaming discipline to reframe what you read. I can describe one that might or might not apply:
If this is really an ordinary file of characters, then modify as follows:
Then apply the algorithm above to avoid buffer boundary conditions.