I’m working on a simple job threading framework which is very similar to the one described in id Tech 5 Challenges. On the most basic level, I have a set of lists of jobs, and I want to schedule these list across a bunch of CPU threads (using a standard thread pool for the actual dispatching.) However, I wonder how this signal/wait stuff inside a wait list can be efficiently implemented. As I understand it, the wait token blocks the list execution if the signal token has not been executed. This implicitly means that everything before a signal has to finish before the signal can be raised. So let’s say we have a list like this:
J1, J2, S, J3, W, J4
then the dispatching can go like this:
#1: J1, J2, J3
<wait for J1, J2, run other lists if possible>
#2: J4
However, this ain’t that easy as it seems, as given a set of lists, I would have to move some of them between ready and waiting and also have special code to gather all jobs before a signal and tag something onto them, so that they can trigger the signal if and only if they have all finished (meaning for instance that it’s no longer possible to add jobs to the list while it is executed, as following signals access the previously inserted jobs.)
Is there any “standard” way of implementing this efficiently? I also wonder how to best schedule the job list execution, right now, each core grabs a job list, and schedules all jobs in it, which gives pretty good scaling (for 32k jobs à 0.7 ms, I get 101%, which I guess is partly due to the fact that the single threaded version is being scheduled onto different cores some times.)
This is a relatively straighforward scheduling algorithm. A couple of issues seem tricky at first but really aren’t (signal/wait and cache locality). I’ll explain the techniques, then give some code I wrote to illustrate the concepts, then give some final notes on tuning.
Algorithms to use
Handling the signal/wait efficiently is seems tricky at first but actually turns out to be extremely easy. Since signal/wait pairs can’t nest or overlap, there can really be only two being satisfied and one being waited on at any given time. Simply keeping a “CurrentSignal” pointer to the most recent unsatisfied signal is all that necessary to do the bookkeeping.
Making sure that cores don’t jump around between lists too much and that a given list isn’t shared between too many cores is also relatively easy: Each core keeps taking jobs from the same list until it blocks, then switches to another list. To keep all the cores from ganging up on a single list, a WorkerCount is kept for each list that tells how many cores are using it, and the lists are organized so cores select lists with fewer workers first.
Locking can be kept simple by locking only the scheduler or the list you are working on at any time, never both.
You expressed some concern about adding jobs to a list after the list has already started executing. It turns out that supporting this is almost trivial: All it needs is a call from the list to the scheduler when a job is added to a list that is currently completed, so the scheduler can schedule the new job.
Data structures
Here are the basic data structures you’ll need:
Note that the signal points for a given joblist are most conveniently stored separately from the actual list of jobs.
Scheduler implementation
The scheduler keeps track of job lists, assigns them to cores, and executes jobs from the job lists.
AddList adds a job to the scheduler. It must be placed on the Ready or Blocked queue depending on whether it has any work to do (ie. whether any jobs have been added to it yet), so just call UpdateQueues.
UpdateQueues centralizes the queue update logic. Notice the algorithm for selecting a new queue, and also the notification to idle cores when work becomes available:
DoWork is a normal scheduler work except: 1. It selects the JobList with the fewest workers, 2. It works jobs from a given joblist until it can’t any more, and 3. It stores the jobIndex as well as the job so the joblist can easily update completion state (implementation detail).
JobList implementation
The JobList keeps track of how the signal/wait are interspersed with the jobs and keeps track of which signal/wait pairs have already completed everything before their signal point.
The constructor creates a dummy signal point to add jobs to. This signal point becomes a real signal point (and a new dummy is added) whenever a new “signal” is added.
AddJob adds a job to the list. It is marked as incomplete in the SignalPoint. When the job is actually executed, the IncompleteCount of the same SignalPoint is decremented. It is also necessary to tell the scheduler that things might have changed, since the new job could be immediately executable. Note that the scheduler is called after the lock on “this” is released to avoid deadlock.
AddSignal and AddWait add signals and waits to the job list. Notice that AddSignal actually creates a new SignalPoint, and AddWait just fills in the wait point information in the previously created SignalPoint.
The Ready property determines whether the list is ready for additional cores assigned to it. There may be two or three cores working on the list without the list being “ready” if the next job is waiting for a signal before it can start.
GetNextReadyJob is very simple: If we are ready, just return the next job in the list.
MarkJobCompleted is probably the most interesting of all. Because of the structure of the signals and waits, the current job is either before CurrentSignal or is between CurrentSignal and CurrentSignal.Next (if it is after the last actual signal, it will be counted as being between CurrentSignal and the dummy SignalPoint at the end). We need to reduce the count of incomplete jobs. We may also need to go on to the next signal if this count goes to zero. Of course we never pass the dummy SignalPoint at the end.
Note that this code doesn’t have a call to Scheduler.UpdateQueue because we know the scheduler will be calling GetNextReadyJob in just a second and if it returns false it will be calling UpdateQueue anyway.
Tuning based on list length, job length estimates, etc
The code above doesn’t pay any attention to how long the job lists are, so if there are a hundred tiny job lists and one huge one it is possible for each core to take a separate tiny job list and then all congregate on the huge one, leading to inefficiency. This can be solved by making Ready[] an array of priority queues prioritized on
(joblist.Jobs.Count - joblist.NextJobIndex), but with the priority only actually updated in normal UpdateQueue situations for efficiency.This could get even more sophisticated by creating a heuristic that takes into account the number and spacing of signal/wait combinations to determine the priority. This heuristic would be best tuned by using a distribution of job durations and resource usage.
If individual job durations are known, or if good estimates are available for them, then the heuristic could use the estimated remaining duration instead of just the list length.
Final notes
This is a rather standard solution to the problem you present. You can use the algorithms I gave and they will work, including the locking, but you won’t be able to compile the code I wrote above for several reasons:
It is a crazy mix of C++ and C# syntax. I originally started writing in C# then changed a bunch of the syntax to C++ style since I thought that was more likely what you would be using for such a project. But I left in quite a few C#-isms. Fortunately no LINQ ;-).
The LinkedList details have some hand-waving. I assume the list can do First, Last, Add and Remove and that items in the list can do Previous and Next. But I didn’t use the actual API for any real linked list class I know of.
I didn’t compile or test it. I guarantee there is a bug or two in there somewhere.
Bottom line: You should treat the code above as pseudocode even though it looks like the real McCoy.
Enjoy!