I am connecting to a server that will send me streaming data that needs to be processed on a per-line basis. So I have to parse out the individual lines, then process each line. The following code appears to work just fine, but I am wondering if there are any standard design patterns for doing this type of thing. Or is this the way to go?
Does the Queue introduce any serious overhead? I need it to be as fast and efficient as possible, which is also why I strayed away from libraries like twisted.
import socket, multiprocessing
def receive_proc(s, q):
data = ''
while True:
data += s.recv(4096)
if '\n' in data:
lines = data.split('\n')[:-1]
for line in lines:
if len(line) > 0:
q.put(line)
data = data.replace(line+'\n', '', 1)
q = multiprocessing.Queue()
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 1234))
p = multiprocessing.Process(target=receive_proc, args=(s,q))
p.start()
while True:
line = q.get()
# do your processing here
There are certainly valid reasons for wanting to stay away from things like twisted, but I don’t think efficiency is among them – I suspect they’re more likely to be optimised in the right ways. Performance is a tricky beast and often the bottlenecks aren’t really where you thought, which is why you need to profile before you can properly optimise. For example, frameworks may have made the effort to push more of their code out into C extensions which will definitely help performance. If performance is your key motivator, third party stuff is probably the safer option. Also, there’s a big argument for using code that other people have tested and tweaked for all sorts of different use cases and environments – if you end up reinventing too much of the wheel, there’s always the risk it might be missing a few spokes.
However, what you need to do seems quite simple so the overhead of installing and learning a framework, and also adding another runtime dependency to your code, may not be justified. Also, if you’re primarily IO-bound then burning a bit of extra CPU doing your processing isn’t going to make much difference anyway. I’ve certainly avoided things like twisted at times in the past simply because I knew it would be faster (in terms of my time) to write it myself and performance would be “good enough”. I’ve always found twisted’s system of callbacks makes debugging a little bit tricky – getting access to error messages can be a little fraught, for example. It’s by no means impossible and many people use it very successfully, but personally I found it too “fiddly” to be justified for simple tasks.
I think your idea of splitting receiving and processing into their own processes might be a false economy in this case – receiving data from a socket is extremely fast, and if you’re doing significant amounts of processing in pure Python that’s likely to be the dominant performance factor. However, I can’t say for sure without knowing what processing you’re doing. If it’s going to be time-consuming and/or CPU-intensive, and you can process each line independent of previous lines, then it’s probably reasonable but you’d likely want to farm the processing out to a whole set of worker processes. This is pretty easy based on your existing code – just make the main process the receiver instead of the “slave” and create a pool of workers which all share a
Queue. Each worker goes through a loop picking the next item and producing the result. It doesn’t matter how long each takes, they just get the next item as it becomes available (andQueuewill handle that for you).If, however, your processing loop is also primarily IO-bound (e.g. writing to a file) then you might find a single process is actually better than the overhead of pushing everything down a pipe. This depends on many factors including your CPU architecture (some systems make transfers between CPU cores more expensive than others), but ultimately you don’t want to use multiple processes unless you’re pretty confident it’s going to give you a performance win.
Anyway, if the loop is IO-bound you might find a single process with non-blocking IO is the way to go. You can use Python’s select module to do this yourself, or you may find it cleaner using a library like eventlet or gevent.
Unrelated aside – your method of stripping the start off the buffer is quite inefficient – you don’t need to use
replace()you can just use your existingsplit(), like this: