I need to scan a 300MB text file with a regex.
- Reading the whole file and taking it into a variable eats over 700MB of RAM and then fails with “cannot allocate memory” error.
- The match can be in two or three lines, so I cannot use line-to-line stepping in loop.
Is there any lazy method to do a full file scan with a regex without reading it into a separate variable?
UPD
Done. Now you can use this function to read by chunks.
Modify it for your goals.
def prepare_session_hash(fname, regex_string, start=0)
@session_login_hash = {}
File.open(fname, 'rb') { |f|
fsize = f.size
bsize = fsize / 8
if start > 0
f.seek(start)
end
overlap = 200
while true
if (f.tell() >= overlap) and (f.tell() < fsize)
f.seek(f.tell() - overlap)
end
buffer = f.read(bsize)
if buffer
buffer.scan(s) { |match|
@session_login_hash[match[0]] = match[1]
}
else
return @session_login_hash
end
end
}
end
Example:
In this text, assume the desired pattern is numeric strings e.g
/d+/smatch digits multiline,Then instead of processing and loading whole file, you can chose a chunk creating pattern, say FULL STOP in this case
.and only read and process till this pattern, then move to next chunk.CHUNK#1:
CHUNK#2:
and so on.
EDIT:
Adding @Ranty’s suggestion from the comments as well: