I’m writing software that runs a bunch of different programs (via twisted’s twistd); that is N daemons of various kinds must be started across multiple machines. If I did this manually, I would be running commands like twistd foo_worker, twistd bar_worker and so on on the machines involved.
Basically there will be a list of machines, and the daemon(s) I need them to run. Additionally, I need to shut them all down when the need arises.
If I were to program this from scratch, I would write a “spawner” daemon that would run permanently on each machine in the cluster with the following features accessible through the network for an authenticated administrator client:
- Start a process with a given command line. Return a handle to manage it.
- Kill a process given a handle.
- Optionally, query stuff like cpu time given a handle.
It would be fairly trivial to program the above, but I cannot imagine this is a new problem. Surely there are existing solutions to doing exactly this? I do however lack experience with server administration, and don’t even know what the related terms are.
What existing ways are there to do this on a linux cluster, and what are some of the important terms involved? Python specific solutions are welcome, but not necessary.
Another way to put it: Given a bunch of machines in a lan, how do I programmatically work with them as a cluster?
The most familiar and universal way is just to use
ssh. To automate you could usefabric.To start
foo_workeron all hosts:To stop
bar_workeron a particular list of hosts:Here’s an example
fabfile.py:There are a number of ways to configure host lists in fabric, with scopes varying from global to per-task, and it’s possible mix and match as needed..
To streamline the process management on a particular host you could write initd scripts for the daemons (and run
service daemon_name start/stop/restart) or usesupervisord(and runsupervisorctle.g.,supervisorctl stop all). To control “what installed where” and to push configuration in a centralized manner something likepuppetcould be used.