I have a complex regex, and I’d like to match it with the contents of an entire huge file. The main concern is efficiency, since the file is indeed very big and running out of memory is a distinct possibility.
Is there a way I can somehow “buffer” the contents while pumping it through a regex matcher?
Yes,
Pattern.match()will take aCharSequence.If your input is already in a charset which uses exactly 2 bytes to represent a character without any ‘prologue’, you need only:
… and since
CharBufferimplementsCharSequence, you’re done.On the other hand, if you need to decode the bytes into some other charset, you’ll have your work cut out, since
CharBufferis charset-agnostic, andCharsetDecorder.decode(ByteBuffer)internally allocates a newCharBufferroughly the same size as the input bytes.Whether or not you’ll be able to get away with a smaller buffer depends a fair bit on your regex and what you want to do with the match results. But the basic approach would be to implement
CharSequenceand wrap the memory-mappedByteBuffer, a smallerCharBufferfor ‘working space’, and aCharsetDecoder. You’ll useCharset.decode(ByteBuffer,CharBuffer,boolean)to decode the bytes ‘on demand’, and hope that the general direction of the regex matcher is ‘forward’, and that the input you’re interested in comes in fairly small chunks.As a rough start:
It might be instructive to wrap a fully-decoded
CharBufferin a much simplerCharSequencewrapper of your own, and log how the methods are actually called for your given input, when you run it with a big heap on your development box. That will give you an idea if this approach is going to work for your particular scenario.