I decided I should try to implement coroutines (I think that’s how I should

Question

0

Asked: May 15, 20262026-05-15T13:37:29+00:00 2026-05-15T13:37:29+00:00

I decided I should try to implement coroutines (I think that’s how I should

0

I decided I should try to implement coroutines (I think that’s how I should call them) for fun and profits. I expect to have to use assembler, and probably some C if I want to make this actually useful for anything.

Bear in mind that this is for educational purposes. Using an already built coroutine library is too easy (and really no fun).

You guys know setjmp and longjmp? They allow you to unwind the stack up to a predefined location, and resumes execution from there. However, it can’t rewind to “later” on the stack. Only come back earlier.

jmpbuf_t checkpoint;
int retval = setjmp(&checkpoint); // returns 0 the first time
/* lots of stuff, lots of calls, ... We're not even in the same frame anymore! */
longjmp(checkpoint, 0xcafebabe); // execution resumes where setjmp is, and now it returns 0xcafebabe instead of 0

What I’d like is a way to run, without threading, two functions on different stacks. (Obviously, only one runs at a time. No threading, I said.) These two functions must be able to resume the other’s execution (and halt their own). Somewhat like if they were longjmping to the other. Once it returns to the other function, it must resume where it left (that is, during or after the call that gave control to the other function), a bit like how longjmp returns to setjmp.

This is how I thought it:

Function A creates and zeroes a parallel stack (allocates memory and all that).
Function A pushes all its registers to the current stack.
Function A sets the stack pointer and the base pointer to that new location, and pushes a mysterious data structure indicating where to jump back and where to set the instruction pointer back.
Function A zeroes most of its registers and sets the instruction pointer to the beginning of function B.

That’s for the initialization. Now, the following situation will indefinitely loop:

Function B works on that stack, does whatever work it needs to.
Function B comes to a point where it needs to interrupt and give A control again.
Function B pushes all of its registers to its stack, takes the mysterious data structure A gave it at the very beginning, and sets the stack pointer and the instruction pointer to where A told it to. In the process, it hands back A a new, modified data structure that tells where to resume B.
Function A wakes up, popping back all the registers it pushed to its stack, and does work until it comes to a point where it needs to interrupt and give B control again.

All this sounds good to me. However, there is a number of things I’m not exactly at ease with.

Apparently, on good ol’ x86, there was this pusha instruction that would send all registers to the stack. However, processor architectures evolve, and now with x86_64 we’ve got a lot more general-purpose registers, and likely several SSE registers. I couldn’t find any evidence that pusha does push them. There are about 40 public registers in a mordern x86 CPU. Do I have to do all the pushes myself? Moreover, there is no push for SSE registers (though there’s bound to be an equivalent—I’m new to this whole “x86 assembler” thing).
Is changing the instruction pointer as easy as saying it? Can I do, like, mov rip, rax (Intel syntax)? Also, getting the value from it must be somewhat special as it constantly changes. If I do like mov rax, rip (Intel syntax again), will rip be positioned on the mov instruction, to the instruction after it, or somewhere between? It’s just jmp foo. Dummy.
I’ve mentioned a mysterious data structure a few times. Up to now I’ve assumed it needs to contain at least three things: the base pointer, the stack pointer and the instruction pointer. Is there anything else?
Did I forget anything?
While I’d really like to understand how things work, I’m pretty sure there are a handful of libraries that do just that. Do you know any? Is there any POSIX- or BSD-defined standard way to do it, like pthread for threads?

Thanks for reading my ~~question~~ textwall.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T13:37:30+00:00

You are correct in that PUSHA wont work on x64 it will raise the exception #UD, as PUSHA only pushes the 16-bit or 32-bit general purpose registers. See the Intel manuals for all the info you ever wanted to know.

Setting RIP is simple, jmp rax will set RIP to RAX. To retrieve RIP, you could either get it at compile time if you already know all the coroutine exit origins, or you could get it at run time, you can make a call to the next address after that call. Like this:

a:
call b
b:
pop rax

RAX will now be b. This works because CALL pushes the address of the next instruction. This technique works on IA32 as well (although I’d suppose there’s a nicer way to do it on x64, as it supports RIP-relative addressing, but I don’t know of one). Of course if you make a function coroutine_yield, it can just intercept the caller address 🙂

Since you can’t push all the registers to the stack in a single instruction, I wouldn’t recommend storing the coroutine state on the stack, as that complicates things anyways. I think the nicest thing to do would be to allocate a data structure for every coroutine instance.

Why are you zeroing things in function A? That’s probably not necessary.

Here’s how I would approach the entire thing, trying to make it as simple as possible:

Create a structure coroutine_state that holds the following:

initarg
arg
registers (also contains the flags)
caller_registers

Create a function:

coroutine_state* coroutine_init(void (*coro_func)(coroutine_state*), void* initarg);

where coro_func is a pointer to the coroutine function body.

This function does the following:

allocate a coroutine_state structure cs
assign initarg to cs.initarg, these will be the initial argument to the coroutine
assign coro_func to cs.registers.rip
copy current flags to cs.registers (not registers, only flags, as we need some sane flags to prevent an apocalypse)
allocate some decent sized area for the coroutine’s stack and assign that to cs.registers.rsp
return the pointer to the allocated coroutine_state structure

Now we have another function:

void* coroutine_next(coroutine_state cs, void* arg)

where cs is the structure returned from coroutine_init which represents a coroutine instance, and arg will be fed into the coroutine as it resumes execution.

This function is called by the coroutine invoker to pass in some new argument to the coroutine and resume it, the return value of this function is an arbitrary data structure returned (yielded) by the coroutine.

store all current flags/registers in cs.caller_registers except for RSP, see step 3.
store the arg in cs.arg
fix the invoker stack pointer (cs.caller_registers.rsp), adding 2*sizeof(void*) will fix it if you’re lucky, you’d have to look this up to confirm it, you probably want this function to be stdcall so no registers are tampered with before calling it
mov rax, [rsp], assign RAX to cs.caller_registers.rip; explanation: unless your compiler is on crack, [RSP] will hold the instruction pointer to the instruction that follows the call instruction that called this function (ie: the return address)
load the flags and registers from cs.registers
jmp cs.registers.rip, efectively resuming execution of the coroutine

Note that we never return from this function, the coroutine we jump to “returns” for us (see coroutine_yield). Also note that inside this function you may run into many complications such as function prologue and epilogue generated by the C compiler, and perhaps register arguments, you have to take care of all this. Like I said, stdcall will save you lots of trouble, I think gcc’s -fomit-frame_pointer will remove the epilogue stuff.

The last function is declared as:

void coroutine_yield(void* ret);

This function is called inside the coroutine to “pause” execution of the coroutine and return to the caller of coroutine_next.

store flags/registers in cs.registers
fix coroutine stack pointer (cs.registers.rsp), once again, add 2*sizeof(void*) to it, and you want this function to be stdcall as well
mov rax, arg (lets just pretend all the functions in your compiler return their arguments in RAX)
load flags/registers from cs.caller_registers
jmp cs.caller_registers.rip This essentially returns from the coroutine_next call on the coroutine invoker’s stack frame, and since the return value is passed in RAX, we returned arg. Let’s just say if arg is NULL, then the coroutine has terminated, otherwise it’s an arbitrary data structure.

So to recap, you initialize a coroutine using coroutine_init, then you can repeatedly invoke the instantiated coroutine with coroutine_next.

The coroutine’s function itself is declared:
void my_coro(coroutine_state cs)

cs.initarg holds the initial function argument (think constructor). Each time my_coro is called, cs.arg has a different argument that was specified by coroutine_next. This is how the coroutine invoker communicates with the coroutine. Finally, every time the coroutine wants to pause itself, it calls coroutine_yield, and passes one argument to it, which is the return value to the coroutine invoker.

Okay, you may now think “thats easy!”, but I left out all the complications of loading the registers and flags in the correct order while still maintaining a non corrupt stack frame and somehow keeping the address of your coroutine data structure (you just overwrote all your registers), in a thread-safe manner. For that part you will need to find out how your compiler works internally… good luck 🙂

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I decided I should try to implement coroutines (I think that’s how I should

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply