I’m working on writing a small log scraping program in Python, which processes a rolling logfile and stores the offsets in the file for lines of interest.
My original solution was running rather quickly on large files but I didn’t have a method for clearing the storage which means that if the program were to continue running, the memory usage would steadily increase until the program consumed all of the memory available to it. My solution was to use collections.deque with maxlen set so that the list would operate as a circular buffer, discarding the oldest loglines as more came in.
While this fixes the memory issue, I’m faced with a major performance loss in calling items from the deque by index. As an example, this code runs much slower than the old equivalent, where self.loglines was not a deque. Is there a way to improve it’s speed, or make a circular buffer where random-access is a constant time operation (instead of, I assume, O(n))?
def get_logline(self, lineno):
"""Gets the logline at the given line number.
Arguments:
lineno - The index of the logline to retrieve
Returns: A string containing the logline at the given line number
"""
start = self.loglines[lineno].start
end = self.loglines[lineno+1].start
size = end - start
if self._logfile.closed:
self._logfile = open(self.logpath, "r")
self._logfile.seek(start)
logline = self._logfile.read(size)
return logline
As with all double-linked lists, random access in a
collections.dequeis O(n). Consider using a list of bounded lists so that clearing of old entries (del outer[0]) can still proceed in a timely manner even with hundreds of thousands of entries.