i perform performance tests for a few java applications. Applications produce very big log files( it can be 7-10 GB) during the test . I need to trim these log files between specific dates and time. currently, i use python script, which parses log timestamps in datetime python object and print only matched strings. But this solution is very slow. 5 GB log is parsed about 25 minutes
Obviously entries in log file is sequentially and i don’t need to read all file line by line.
I thought about reading file from the start and from the end, until condition is matched and print files between matched number of lines. But i don’t know how can i read file from the backwards, without downloading it to the memory.
Please, can you suggest me any suitibale solution for this case.
here is part of python script:
lfmt = '%Y-%m-%d %H:%M:%S'
file = open(filename, 'rU')
normal_line = ''
for line in file:
if line[0] == '[':
ltimestamp = datetime.strptime(line[1:20], lfmt)
if ltimestamp >= str and ltimestamp <= end:
normal_line = 'True'
else:
normal_line = ''
if normal_line:
print line,
As the data is sequential if the start and end of the region of interest are near the beginning of the file, then reading from the end of the file (to find the matching end point) is still a bad solution!
I’ve written some code that will quickly find the start and end points as you require, this approach is called binary search and is similar to the clasic childrens “higher or lower” guessing game!
The script reads a trial line mid-way between
lower_boundsandupper_bounds(initially the SOF and EOF), and checks the match criteria. If the sought line is earlier, then it guesses again by reading a line half-way between thelower_boundand the previous read trial (if its higher then it splits between its guess and the upper bound). So you keep iterating between upper and lower bounds – this yields the fastest possible “on average” solution.This should be a real quick solution (log to the base 2 of the number of lines!!). For example in the worst possible case (finding line 999 out of a 1000 lines), using binary search would take just 9 line reads! (from a billion lines would take just 30…)
Assumptions for the below code:
Further:
file.seek()is superior), thanks to TerryE and J.F. Sebastian for pointing that out.import datetime