I decided I should try to implement coroutines (I think that’s how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.
Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).
You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can’t rewind to “later” on the stack. Only come back earlier.
jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0
What I’d like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other’s execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.
This is how I thought it:
- Function
Acreates and zeroes a parallel stack (allocates memory and all that). - Function
Apushes all its registers to the current stack. - Function
Asets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back. - Function
Azeroes most of its registers and sets the instruction pointer to the beginning of functionB.
That’s for the initialization. Now, the following situation will indefinitely loop:
- Function
Bworks on that stack, does whatever work it needs to. - Function
Bcomes to a point where it needs to interrupt and giveAcontrol again. - Function
Bpushes all of its registers to its stack, takes the mysterious data structureAgave it at the very beginning, and sets the stack pointer and the instruction pointer to whereAtold it to. In the process, it hands backAa new, modified data structure that tells where to resumeB. - Function
Awakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and giveBcontrol again.
All this sounds good to me. However, there is a number of things I’m not exactly at ease with.
- Apparently, on good ol’ x86, there was this
pushainstruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we’ve got a lot more general-purpose registers, and likely several SSE registers. I couldn’t find any evidence thatpushadoes push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all thepushes myself? Moreover, there is nopushfor SSE registers (though there’s bound to be an equivalent—I’m new to this whole “x86 assembler” thing). Is changing the instruction pointer as easy as saying it? Can I do, like,It’s justmov rip, rax(Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do likemov rax, rip(Intel syntax again), willripbe positioned on themovinstruction, to the instruction after it, or somewhere between?jmp foo. Dummy.- I’ve mentioned a mysterious data structure a few times. Up to now I’ve assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
- Did I forget anything?
- While I’d really like to understand how things work, I’m pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like
pthreadfor threads?
Thanks for reading my question textwall.
You are correct in that
PUSHAwont work on x64 it will raise the exception#UD, asPUSHAonly pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.Setting
RIPis simple,jmp raxwill setRIPtoRAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:RAXwill now beb. This works becauseCALLpushes the address of the next instruction. This technique works on IA32 as well (although I’d suppose there’s a nicer way to do it on x64, as it supports RIP-relative addressing, but I don’t know of one). Of course if you make a functioncoroutine_yield, it can just intercept the caller address 🙂Since you can’t push all the registers to the stack in a single instruction, I wouldn’t recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.
Why are you zeroing things in function
A? That’s probably not necessary.Here’s how I would approach the entire thing, trying to make it as simple as possible:
Create a structure
coroutine_statethat holds the following:initargargregisters(also contains the flags)caller_registersCreate a function:
coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);where
coro_funcis a pointer to the coroutine function body.This function does the following:
coroutine_statestructurecsinitargtocs.initarg, these will be the initial argument to the coroutinecoro_functocs.registers.ripcs.registers(not registers, only flags, as we need some sane flags to prevent an apocalypse)cs.registers.rspcoroutine_statestructureNow we have another function:
void* coroutine_next(coroutine_state cs, void* arg)where
csis the structure returned fromcoroutine_initwhich represents a coroutine instance, andargwill be fed into the coroutine as it resumes execution.This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.
cs.caller_registersexcept forRSP, see step 3.argincs.argcs.caller_registers.rsp), adding2*sizeof(void*)will fix it if you’re lucky, you’d have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling itmov rax, [rsp], assignRAXtocs.caller_registers.rip; explanation: unless your compiler is on crack,[RSP]will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)cs.registersjmp cs.registers.rip, efectively resuming execution of the coroutineNote that we never return from this function, the coroutine we jump to “returns” for us (see
coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc’s -fomit-frame_pointer will remove the epilogue stuff.The last function is declared as:
void coroutine_yield(void* ret);This function is called inside the coroutine to “pause” execution of the coroutine and return to the caller of
coroutine_next.in cs.registerscs.registers.rsp), once again, add2*sizeof(void*)to it, and you want this function to be stdcall as wellmov rax, arg(lets just pretend all the functions in your compiler return their arguments inRAX)cs.caller_registersjmp cs.caller_registers.ripThis essentially returns from thecoroutine_nextcall on the coroutine invoker’s stack frame, and since the return value is passed inRAX, we returnedarg. Let’s just say ifargisNULL, then the coroutine has terminated, otherwise it’s an arbitrary data structure.So to recap, you initialize a coroutine using
coroutine_init, then you can repeatedly invoke the instantiated coroutine withcoroutine_next.The coroutine’s function itself is declared:
void my_coro(coroutine_state cs)cs.initargholds the initial function argument (think constructor). Each timemy_corois called,cs.arghas a different argument that was specified bycoroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it callscoroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.Okay, you may now think “thats easy!”, but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally… good luck 🙂