I’m currently working my way through Andrew Appel’s Modern Compiler Implementation in Java, and I’m right around the point where I build the low-level intermediate representation.
Initially, I had decided to target the JVM and ignore all of the low-level machine stuff, but in the interest of learning things that I don’t know much about I’ve had a change of heart. This changes my IR, because targeting the JVM allows me to (more or less) wave my hands at making a method call or constructing an object.
The Appel book doesn’t go into detail about any specific machine architecture, so I’d like to know where I can find out everything I need to know to go farther.
The things that I’m currently aware that I need to know are:
-
Which instruction set to use. I have two laptops I could develop on; both have Core 2 Duo processors. My current understanding is that x86 processors mostly use the same instruction set, but they are not all exactly the same.
-
Whether the operating system affects the code generation step of compilation, or whether it is completely dependent on the processor. For example, I know something is different about generating code to run on a 32-bit vs. a 64-bit platform.
-
How stack frames and such are organized. When to use registers vs. putting parameters on the stack, caller-save vs. callee-save, all of that. I’d have thought that this would be described along with the instruction set but so far I haven’t seen this particular info anywhere. Maybe I’m misunderstanding something here?
Links to resources in lieu of answers are perfectly welcomed.
Most of the x86 instruction set is common to all processors — it’s a reasonably safe bet that your processors both have the same instruction set, except possibly for SIMD instructions that probably won’t be very useful to you when implementing a simple compiler (these instructions are normally used to make multimedia applications and the like go faster). The instruction set is listed in Intel’s manuals — 2A and 2B in particular have a full listing of instructions and their behaviour, although the other volumes are worth taking a look at.
When generating user space code, the choice of operating system matters when it comes to syscalls. For instance, if you want a program to output something to the terminal on 64 bit Linux, you need to make a system call by:
raxto indicate this is awritesystem call.rdito indicate stdout should be used (1 is the file descriptor for stdout)rsirdxsyscallinstruction once the registers (and memory) have been set up.The return value from
writeis stored inrax.A different operating system might have a different system call number for
write, might have a different way of passing in arguments (x86-64 Linux system calls always userdi,rsi,rdx,r10,r8, andr9in that order for parameters, with the system call number inrax), and might have different system calls altogether.The convention for ordinary function calls on Linux is similar — the order of registers is
rdi,rsi,rdx,rcx,r8, andr9(so all the same, except usingrcxinstead ofr10), with further arguments on the stack and a return value inrax. According to this page, registersrbp,rbx, andr12up tor15should be preserved across function calls. You are, of course, free to make up your own convention (unless making a system call), but that makes it harder to call be called from code generated or written by others.