I have a network of 100 machines, all running Ubuntu Linux.
On a continuous (streaming) basis, machine X is ‘fed’ with some real-time data. I need to write a python script that would get the data as input, load it in-memory, process it, and then save it to disk.
It’s a lot of data, hence, I would ideally want to split the data in memory (using some logic) and just send pieces of it to each individual computer, in the fastest possible way. each individual computer will accept its piece of data, handle it and write it to its local disk.
Suppose I have a container of data in Python (be it a list, a dictionary etc), already processed and split to pieces. What is the fastest way to send each ‘piece’ of data to each individual machine?
You have two (classes of) choices:
In the simplest case, you write a program on each machine in your network that simply listens, processes and writes. You distribute from X to each machine in your pool round-robin. But, you might want to address higher-level concerns like handling node failures or dealing with requests that take longer to process than others, adding new nodes to the system, etc.
As you want more functionality, you’ll probably want to find some existing tool to help you. It sounds like you might want to investigate some combinations of AMQP (for reliable messaging), Hadoop (for distributed data processing) or more complete NoSQL solutions like Cassandra or Riak. By leveraging these tools, your system will be significantly more robust than what you could probably build out yourself.