After going through some text and source code i realized that fork, vfork and clone all three are executed through do_fork in fork.c with different parameters.
But how exactly fork() calls do_fork()..
When calling fork() which all functions are called ?
What is the step by step class to do_fork() from fork()?
libc‘s implementation of
fork()and other system calls contain special processor instructions that invoke a system call. System call invocation is architecture-specific, and can be a quite complex topic.Let’s begin with a “simple” example, MIPS:
On MIPS system calls are invoked via the SYSCALL instruction. So, libc’s implementation of
fork()ends up putting some arguments on some registers, the system call number in regiterv0, and issuing asyscallinstruction.On MIPS, this causes a
SYSCALL_EXCEPTION(exception number 8). When booting, the kernel associates exception 8 to a handling routine inarch/mips/kernel/traps.c:trap_init():So when the CPU receives an exception 8 because a program has issued a
syscallinstruction, the CPU transitions into kernel mode, and begins executing the handler athandle_sysat/usr/src/linux/arch/mips/kernel/scall*.S(there are several files for the different 32/64 bits kernelspace/userspace combinations). That routine looks up the system call number in the system call table and jumps to the appropriatesys_...()function, in this examplesys_fork().Now, x86 is more complicated. Traditionally, Linux used interrupt 0x80 to invoke system calls. This is associated to an x86 gate in
arch/x86/kernel/traps_*.c:trap_init():An x86 processor has several levels (rings) of privilege (since 80286). It is only possible to access (jump to) a lower ring (= more privilege) through predefined gates, which are special kinds of segment descriptors set by the kernel. So, when an
int 0x80is called, an interrupt is generated, the CPU looks up a special table called the IDT (Interrupt Descriptor Table), sees that it has a gate (a trap gate in x86, an interrupt gate in x86-64), and transitions into ring 0, beginning the execution of thesystem_call/ia32_syscallhandler atarch/x86/kernel/entry_32.S/arch/x86/ia32/ia32entry.S(for x86/x86_64 respectively).But, since the Pentium Pro, there is an alternative way to invoke a system call: using the
SYSENTERinstruction (AMD also has its ownSYSCALLinstruction). This is a more efficient way to invoke a system call. The handler for this “newer” mechanism is set atarch/x86/vdso/vdso32-setup.c:syscall32_cpu_init():The above uses machine specific registers (MSRs) to do the setup. The handler routines are
ia32_sysenter_targetandia32_cstar_target(this last one only for x86_64) (inarch/x86/kernel/entry_32.Sorarch/x86/ia32/ia32entry.S).Choosing which syscall mechanism to use
The linux kernel and glibc have a mechanism to choose between the different ways to invoke a system call.
The kernel sets up a virtual shared library for each process, it’s called the VDSO (virtual dynamic shared object), which you can see in the output of
cat /proc/<pid>/maps:This vdso, among other things, contains an appropriate system call invocation sequence for the CPU in use, e.g:
In
arch/x86/vdso/vdso32/there are implementations usingint 0x80,sysenterandsyscall, the kernel selects the appropriate one.To let userspace know that there is a vdso, and where it is located, the kernel sets
AT_SYSINFOandAT_SYSINFO_EHDRentries in the auxiliary vector (auxv, the 4th argument tomain(), afterargc, argv, envp, which is used to pass some information from the kernel to newly started processes).AT_SYSINFO_EHDRpoints to the ELF header of the vdso,AT_SYSINFOpoints to the vsyscall implementation:glibc uses this information to locate the
vsyscall. It stores it into the dynamic loader global_dl_sysinfo, e.g.:and in a field in the header of the TCB (thread control block):
If the kernel is old and doesn’t provide a vdso, glibc provides a default implementation for
_dl_sysinfo:When a program is compiled against glibc, depending on circumstances, a choice is made between different ways of invoking a system call:
int 0x80← the traditional waycall *%gs:offsetof(tcb_head_t, sysinfo)←%gspoints to the TCB, so this jumps indirectly through the pointer to vsyscall stored in the TCB. This is prefered for objects compiled as PIC. This requires TLS initialization. For dynamic executables, TLS is initialized by ld.so. For static PIE executables, TLS is initialized by __libc_setup_tls().call *_dl_sysinfo← this jumps indirectly through the global variable. This requires relocation of _dl_sysinfo, so it is avoided for objects compiled as PIC.So, in x86: