I’m interested in building a python script that can give me stats on how many lines per interval (maybe minute) are being written to a file. I have files that are being written as data comes in, a new line for each user the passes data through the external program. Knowing how many lines per x gives me a metric that I can use for future expansion planning. The output file(s) consist of lines, all relatively the same length and all with line returns at the end. I was thinking of writing a script that did something like: measures the length of the file at a specific point and then measures it again at another point in the future, subtract the two and get my result… however I don’t know if this is ideal since it takes time to measure the length of the file and that may skew my results. Does anyone have any other ideas?
based on what people are saying I threw this together to start:
import os
import subprocess
import time
from daemon import runner
#import daemon
inputfilename="/home/data/testdata.txt"
class App():
def __init__(self):
self.stdin_path = '/dev/null'
self.stdout_path = '/dev/tty'
self.stderr_path = '/dev/tty'
self.pidfile_path = '/tmp/count.pid'
self.pidfile_timeout = 5
def run(self):
while True:
count = 0
FILEIN = open(inputfilename, 'rb')
while 1:
buffer = FILEIN.read(8192*1024)
if not buffer: break
count += buffer.count('\n')
FILEIN.close( )
print count
# set the sleep time for repeated action here:
time.sleep(60)
app = App()
daemon_runner = runner.DaemonRunner(app)
daemon_runner.do_action()
It does the job of getting the count every 60 seconds and printing it out to the screen, my next step is the math I guess.
One more edit: I’ve added the output of the count in one minute intervals:
import os
import subprocess
import time
from daemon import runner
#import daemon
inputfilename="/home/data/testdata.txt"
class App():
def __init__(self):
self.stdin_path = '/dev/null'
self.stdout_path = '/dev/tty'
self.stderr_path = '/dev/tty'
self.pidfile_path = '/tmp/twitter_counter.pid'
self.pidfile_timeout = 5
def run(self):
counter1 = 0
while True:
count = 0
FILEIN = open(inputfilename, 'rb')
while 1:
buffer = FILEIN.read(8192*1024)
if not buffer: break
count += buffer.count('\n')
FILEIN.close( )
print count - counter1
counter1 = count
# set the sleep time for repeated action here:
time.sleep(60)
app = App()
daemon_runner = runner.DaemonRunner(app)
daemon_runner.do_action()
To comment on your idea (which seems pretty sound to me), how accurate do you need the measurement to be?
I’d suggest to measure the measurement time first. Then, given the relative accuracy you want to achieve, you can calculate the time interval between consecutive measurements, e.g. if measurement takes t milliseconds and you want 1% accuracy, don’t measure more often than once in 100t ms.
Although, measurement time will grow as the file grows, you’ll have to keep that in mind.
Hint on how to count the lines in a file: is there a built-in python analog to unix 'wc' for sniffing a file?
Hint on how to measure time:
timemodule.P.S. I just tried timing the line-counter on a 245M file. First time it took about 10 seconds (didn’t time it on the first run) but then it was always below 1s. Maybe some caching is done there, I’m not sure.