Suppose that one has set up a cassandra cluster. You’ve got a 10[TB] database that is distributed evenly between 10 nodes, everything runs smoothly etc.
Suppose that you have 100 machines at your disposal, each trying to read (different) data from the cassandra cluster. in addition, you have many jobs that constantly need to be run, each job at a different time (and obviously, each job needs to be run on a different machine).
How do you manage all these tasks/jobs? how do you distribute the tasks between the machines? how do you keep track of the jobs / machines in the process?
Are there any open-source tools (preferably, with a Python client) that help doing it in a Linux environment?
What you need is a Grid/HPC Framework to handle your distributed infrastructure and to run jobs.
In unix/linux there are two systems that might of good use for you. Portable Batch Systems (PBS) or Condor
Both Condor and PBS have a master need to act as receptor of every Job/Task, for every Job/Task you can associate level of priority and discriminators. The administrator of the cluster sets up rules based on those discriminators to schedule the jobs.
Condor or PBS will do it for you, you only need to submit the job to the master node and specify priority, inputs and outputs, etc.
You can periodically check for when a job is finished, subscribe for notification via different mechanisms or do a sort of
job.wait()to block till its finished.Both PBS and Condor have
topalike commands to list jobs that are queued in wait, or running, or cancel. They also have utilities to stop or cancel a job if the process allows snapshots.For a large cluster, my advice is to try Condor. It’s been there for ages to solve problems exactly like they one you have. Here there are some examples for Condor + Python
Other more recent solutions to consider are: