Article

I recently came across this video with Rob Landley called Tutorial: Building the Simplest Possible Linux System. To people who are familiar with OS development, I assume this is a trivial tutorial, but for me personally it made the whole point of PID 1 and init systems a lot clearer to me.

As I understood it, the kernel takes charge of initializing hardware properly and makes a more generic interface to the system available through interfaces (syscalls, etc). After the kernel has booted, a special process is launched called “init”. This program can be anything we decide it to be.

For a learning experience, I thought it would be interesting to follow the first steps from the talk where he boots a Linux kernel under QEMU and has the init task be just the standard “Hello World” program.

From a high level perspective, we need to go through the following steps:

  1. Have qemu-system-x86_64 installed
  2. Write a “Hello world” program to act as our init system
  3. Package our init program in a root filesystem
  4. Clone and build a Linux kernel

Finally, we put all the pieces together to boot a simply useless Linux system.

The most interesting part of this exercise is (IMO) creating the init system and root filesystem. The program in itself it probably not that interesting, but maybe not entirely what you expect:

#include <stdio.h>
int main(int argc, char *argv[])
{
    printf("Hello, world!\n");
    sleep(999999999);
}

The init program is not supposed to exit which is why we have added the sleep(999999999) statement. If PID 1 exits the kernel panics, so we’ll try to avoid it 1.

We want to build this program as a static binary, and package it up as a root filesystem.

$ gcc -static init.c -o init
$ find . | cpio -o -H newc | gzip > ../root.cpio.gz

The second line grabs all the files in the folder with our init program and puts it in a cpio archive file - -o creates the archive and -H indicates the archive format. According to the cpio(1) man page “newc” is a new SVR4 portable format (whatever that means).

Next we want to clone and build a Linux kernel. We’ll just go with a default configuration for x86_64:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ make x86_64_defconfig
$ make -j $(nproc)
…
Kernel: arch/x86/boot/bzImage is ready  (#1)

Now we have all the bits in place to run QEMU. We indicate the kernel we’ve just built and the root filesystem we packaged up before:

$ qemu-system-x86_64 -nographic -no-reboot \
    -kernel ~/sources/linux/arch/x86/boot/bzImage \
    -initrd root.cpio.gz \
    -append "panic=1 console=ttyS0"

This actually boots the Linux kernel but the following happens when running our own init:

[    2.129689] Run /init as init process
[    2.153047] traps: init[1] trap invalid opcode ip:40338d sp:7fff37be1d60 error:0 in init[401000+80000]
[    2.156679] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
[    2.157243] CPU: 0 PID: 1 Comm: init Not tainted 5.14.0-rc4+ #2
[    2.157566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
[    2.158028] Call Trace:
[    2.159200]  dump_stack_lvl+0x34/0x44
[    2.159488]  panic+0xf6/0x2b7
[    2.159620]  do_exit.cold+0xc3/0xcf
[    2.159757]  do_group_exit+0x2e/0x90
[    2.159896]  get_signal+0x154/0x850
[    2.160031]  arch_do_signal_or_restart+0xf8/0x710
[    2.160205]  ? force_sig_info_to_task+0xb9/0xf0
[    2.160375]  exit_to_user_mode_prepare+0xc8/0x140
[    2.160550]  ? asm_exc_invalid_op+0xa/0x20
[    2.160698]  irqentry_exit_to_user_mode+0x5/0x10
[    2.160865]  asm_exc_invalid_op+0x12/0x20
[    2.161259] RIP: 0033:0x40338d
[    2.161633] Code: 00 00 48 85 f6 0f 8e 77 02 00 00 3d 07 00 00 80 0f 86 85 01 00 00 b8 08 00 00 80 0f6
[    2.162360] RSP: 002b:00007fff37be1d60 EFLAGS: 00000246
[    2.162587] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    2.162841] RDX: 0000000000000001 RSI: 0000000001000000 RDI: 0000000000000007
[    2.163093] RBP: 0000000000403a50 R08: 0000000000000040 R09: 0000000001000000
[    2.163346] R10: 0000000000000040 R11: 0000000000000010 R12: 0000000000403ae0
[    2.163600] R13: 0000000000000000 R14: 0000000000080000 R15: 0000000000000010
[    2.164624] Kernel Offset: 0x9e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0x)
[    2.165372] Rebooting in 1 seconds..

What happened? It appears that our very simple init program attempted to execute an invalid opcode:

[    2.153047] traps: init[1] trap invalid opcode ip:40338d sp:7fff37be1d60 error:0 in init[401000+80000]

That’s not good.

The program itself should not require any exotic opcodes in itself, but eventually it dawned on me that the init program includes the entire glibc (because we built it statically). And as I’m running Gentoo, glibc has been built and installed from source on the host machine, so I’m pretty sure that what we’ve run into is a case where the actual hardware supports opcodes that the emulator doesn’t.

How do we remove glibc as a dependency? We obviously can’t just add -nostdlib to our build command and be done:

$ gcc -static -nostdlib -o init hello.c
/usr/lib/gcc/x86_64-pc-linux-gnu/10.3.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000
/usr/lib/gcc/x86_64-pc-linux-gnu/10.3.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccwIbTwW.o: in function `main':
hello.c:(.text+0x17): undefined reference to `puts'
/usr/lib/gcc/x86_64-pc-linux-gnu/10.3.0/../../../../x86_64-pc-linux-gnu/bin/ld: hello.c:(.text+0x21): undefined reference to `sleep'
collect2: error: ld returned 1 exit status

So instead we need to replace the call to printf(3) and the call to sleep(3) with something that doesn’t depend on glibc. What it boils down to is that we have to go straight to syscalls(2) and interact with that API directly, instead of calling functions provided by our libc implementation.

Let us start with printf(3). I ended up finding this blog post that goes into a lot of great detail (despite terrible formatting) about the syscalls we have to use to replace printf(). In short, we need to use write(2) syscall. The author goes into details on figuring out the syscall, writing assembly to execute the syscall and building a static binary from assembly and the C code without any libc dependencies.

I found another approach though. Initially I copied the approach of writing assembly and building a nostdlib, static binary, but I need to replace not only write(2) but also sleep(3). The latter it turns out is implemented via nanosleep(2) which has a somewhat complicated signature (it accepts structs), so instead (although figuring out how to call nanosleep() correctly is also an interesting challenge) I thought of another idea: pause(2).

pause(2) puts the “calling process to sleep until a signal is delivered that either terminates the process or causes the invocation of a signal-catching function” 2. Instead of sleeping for a really long time, let us just have our useless system wait for a signal that never arrives! Implementing the syscall to call pause(2) should also be relatively simple because it takes no arguments.

So how do we invoke syscalls directly? It turns out that there’s a nolibc.h header file in the Linux tools directory that is “designed to be used as a libc alternative for minimal programs with very limited requirements. It consists of a small number of syscall and type definitions, and the minimal startup code needed to call main()”. The documentation even has a sample gcc invocation:

$ gcc -fno-asynchronous-unwind-tables -fno-ident -s -Os -nostdlib \
        -static -include nolibc.h -o hello hello.c -lgcc

Perfect. If we copy the the header file next to our init.c we should have all the definitions we need for making the right syscalls. The header even has a definition for sys_write() that we can use directly. There’s no sys_pause() however so instead we’ll call my_syscall0() with the correct number for pause. For 64-bit x86 platforms that magic number is 34 3 (initially I was lead to believe that it’s 29, but that appears to be for 32-bit x86 and ARM).

Now our init.c source looks as follows:

#define SYS_PAUSE 34

int main(int argc, char *argv[])
{
    sys_write(1, "Hello, World!\n", 14);
    my_syscall0(SYS_PAUSE);
}

and we build it with:

$ gcc -fno-asynchronous-unwind-tables -fno-ident -s -Os -nostdlib
        -static -include nolibc.h -o init init.c -lgcc

After packaging the root filesystem again, we run qemu-system-x86_64 again and behold!

…
[    2.077374] Run /init as init process
Hello, World!
[    2.310487] tsc: Refined TSC clocksource calibration: 2903.968 MHz
[    2.310915] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x29dbe785f1c, max_idle_ns: 440795s
[    2.311322] clocksource: Switched to clocksource tsc
[    2.543974] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3

The kernel boots, runs our useless “Hello world!” init program and then just sits around. To make it stop run “killall qemu-system-x86_64”.