I am using a PBS-based cluster and running IPython parallel over a set of nodes, each with either 24 or 32 cores and memory ranging from 24G to 72G; this heterogeneity is due to our cluster having history to it. In addition, I have jobs that I am sending to the IPython cluster that have varying resource requirements (cores and memory). I am looking for a way to submit jobs to the ipython cluster that know about their resource requirements and those of the available engines. I imagine there is a way to deal with this situation gracefully using IPython functionality, but I have not found it. Any suggestions as to how to proceed?
Share
In addition to graph dependencies, which you indicate that you already get, IPython tasks can have functional dependencies. These can be arbitrary functions, like tasks themselves. A functional dependency runs before the real task, and if it returns False or raises a special
parallel.UnmetDependencyexception, the task will not be run on that engine, and will be retried somewhere else.So to use this, you need a function that checks whatever metric you need. For instance, let’s say we only want to run a task on your nodes with a minimum amount of memory. Here is a function that checks the total memory on the system (in bytes):
so
minimum_mem(4 * GB)will return True iff you have at least 4GB of memory on your system. If you want to check available memory instead of total memory, you can use the MemFree and Inactive values in /proc/meminfo to determine what is not already in use.Now you can submit tasks only to engines with sufficient RAM by applying the
@parallel.dependdecorator:Similarly, you can apply restrictions based on the number of CPUs (
multiprocessing.cpu_countis a useful function there).Here is a notebook that uses these to restrict assignment of some dumb tasks.
Typically, the model is to run one IPython engine per core (not per node), but if you have specific multicore tasks, then you may want to use a smaller number (e.g. N/2 or N/4). If your tasks are really big, then you may actually want to restrict it to one engine per node. If you are running more engines per node, then you will want to be a bit careful about running high resource tasks together. As I have written them, these checks do not take into account other tasks on the same node, so if a node as 16 GB of RAM, and you have two tasks that each need 10, you will need to be more careful about how you track available resources.