Verifying and documenting live-bootstrap

The goal of the live-bootstrap project is to compile the necessary tools to compile Linux from a minimal binary footprint to avoid the possibility that a (binary) compiler could be used to introduce back-doors into the Linux kernel. As a user of the live-bootstrap project, one should be able to trace and review all steps and sources used. The goal of this project is to facilitate this.

The results of this can be found on the following pages:

T-diagram: Zoomable diagram of all the processes for the x86 bootstrap process until tcc-boot0.
x86: Single page with all the information about the sources, processes, and results for the x86 bootstrap process until tcc-boot0.
AMD64: Single page with all the information about the sources, processes, and results for the AMD64 bootstrap process until tcc-boot0.

Background and history

The text is based on a presentation I gave on Hackfest 2024 on Sunday, September 29, 2024 in Dutch. A recording of the presentation on YouTube.

Can we trust Linux?

Linux is an open-source operating system build from available sources. For closed source operating systems, like Windows and MacOS, there is no way to verify that it is reliable.

Malicious compiler

Although the sources of Linux are all known, building the Linux kernel a compiler is needed to compile the C-source files into executable of the target platform.

A compiler is a program which translates a program from an input language to an output language. The input language is often a high-level, human readable language, while the output language is a low-level language, including machine language, which can be executed by the CPU. A compiler, like any other program, is programmed in a language, usually in machine language, or otherwise, depends on some other program based on machine language.

Machine language is usually in a binary form that is not easily readable by humans. This poses a problem for a non-trivial program, such as an optimizing compiler. Although reverse engineering executables is possible, it is a lot of work.

A malicious compiler could insert a back-door in some executables or libraries of the Linux kernel. Such a compiler could recognize when it is compiled by itself and duplicate the malicious code in the resulting compiler, making it malicious as well.

Ken Thompson presented the idea of a malicious compiler during the Turing Award Lecture Reflections on Trusting Trust in 1984.

The live-bootstrap approach

The live-bootstrap approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small binary file. Next the Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4, 4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last compiler can be used to compiler the Linux kerner sources.

For executables, Linux makes use of Executable and Linkable Format (ELF) files. These files start with a header, followed by the program header table, an number of code and/or data segments, and (optionally) a number of section header tables, which include symbol tables with information that can be used by a debugger.

hex0

The small binary is a program called hex0, which converts a file with hexadecimal numbers into a binary file. This program is specific for the target CPU for which Linux needs to be build. These programs can be found in the bootstrap-seeds repository. In the README.md file it states: 'NEVER TRUST ANYTHING IN HERE'. For x86 (32 bits) the hexadecimal representation of this program is given in the file hex0_x86.hex0, meaning that if this file is compiled with hex0 it will return hex0, assuming that it is not a malicious variant of hex0. The hex0 program requires two arguments, the names of the input and output files. The function of the program can also be achieved with following command line:

sed 's/[;#].*$//g' $input_file | xxd -r -p > $output_file

This makes use of the stream editing program sed and the xxd (hex dump) program that can be used to convert, in both directions, between binary files and hexadecimal files.

Brainfuck

Brainfuck is an esoteric program language with only eight commands, represented by the letters '+-<>[].,'. I wanted to see if it was possible to write a Brainfuck program to replace hex0. I wrote some JavaScript to generate a program from a simple programming language: BF generator. I verified that with some Brainfuck interpreter running it on hex0_x86.hex0 did produce a file equal to hex0.

Blog posts:

Parsing kaem files

I decided to look some deeper into live-bootstrap to figure out which programs are executed and which files are being read. There is a global description of all the steps that is found in parts.rst. live-bootstrap comes with a script to execute all steps and a minimal shell, called the kaem shell, to execute the script. I wrote a program, kaem_parser.cpp, to parse the kaem files, which generates the HTML page live-bootstrap with all the information.

Blog posts:

Emulator

To even dive deeper, I decided to implement an x86 emulator that implements only the x86 instructions that are actually used. The emulator also implments systems calls and multiple processes. I decided to simply map file operations on file operations on the file system, instead of simulating those in memory.

This allowed me also to verify the code of hex0 and to see if no other binary files were used.

The development of the emulator required some very complicated debuging, but it was also insightfull.

The emulator turned out too be slow to compile GNU Mes.

Blog posts:

strace

With strace command it is possible to trace system calls that are performed during the execution of a command. I wrote a shell script to execute all the steps with an alternative root (using the chroot command) and trace some relevant system calls. I also wrote a program, called scan_trace.cpp to process the produced log file and generate an HTML page listing the processes and source files. It also generates a JSON file with similar information, which is used to generate a T-diagram

GNU Mes

GNU Mes consist of Scheme (a dialect of the LISP family) interpreter programmed in a subset of C and a C compiler written in Scheme. It is a rather complete compiler including a separate linker. I get the idea that it is on par with the Tiny C compiler, which does not need a separate linker, because is not needed for compiling the Tiny C compiler. The MES compiler is rather slow, which is understandable because it is an interpreting runing an interpreting compiler. It is also slow because it compiles many individual C files into object files, which are linked at the end. Everytime all the Scheme files are loaded before the compiler can start its actual work.

The GNU Mes compiler requires a compiler for a minimal subset of C. Because of this the bootstrap does contain include a subset of C compiler and a a partial implementation of the C preprocessor. These are written in another subset of C for which there is compiler written in machine language, one for each supported CPU. I have the impression that the GNU Mes compiler is a bit of overkill for the gap it needs to fill.

On several forums, I have read people raising the question why Forth was not used in the implementation instead of LISP.

I made some investigation and attempts to bridge the gap. To see if it would be possible to implement a compiler with M2-Mesoplanet. I encountered some bugs in M2-Mesoplanet.

Blog posts:

Stack based language

Due to the bugs found in M2-Mesoplanet, I looked into implementing a stack based, Forth like, language that would be close to C and simple to compile. A language that could also serve as an intermediate language. For an example program see test.sl, which is an attempt to implement tcc_cc.c. The file stack_c.cpp contains the compiler for the stack based language.

Blog posts: