Verifying and documenting live-bootstrap
The goal of the live-bootstrap project is to compile the necessary tools to compile Linux
from a minimal binary footprint to avoid the possibility that a (binary)
compiler could be used to introduce back-doors into the Linux kernel. As a user
of the live-bootstrap project, one should be able to trace and review all steps
and sources used. The goal of this project is to facilitate this.
The results of this can be found on the following pages:
- T-diagram: Zoomable diagram of all the
processes for the x86 bootstrap process until tcc-boot0.
- x86: Single page with all the information
about the sources, processes, and results for the x86 bootstrap process
until tcc-boot0.
- AMD64: Single page with all the information
about the sources, processes, and results for the AMD64 bootstrap process
until tcc-boot0.
Background and history
The text is based on a presentation I gave on Hackfest 2024 on
Sunday, September 29, 2024
in Dutch. A
recording of the presentation on YouTube.
Can we trust Linux?
Linux is an open-source
operating system build from available sources. For closed source operating
systems, like Windows and MacOS, there is no way to verify that it is reliable.
Malicious compiler
Although the sources of Linux are all known, building the Linux kernel a
compiler is needed to compile the C-source files into executable of the
target platform.
A compiler is a program which translates a program from an input language to
an output language. The input language is often a high-level, human readable
language, while the output language is a low-level language, including
machine language, which can be executed by the CPU. A compiler, like any
other program, is programmed in a language, usually in machine language, or
otherwise, depends on some other program based on machine language.
Machine language is usually in a binary form that is not easily readable by
humans. This poses a problem for a non-trivial program, such as an optimizing
compiler. Although reverse engineering executables is possible, it is a lot
of work.
A malicious compiler could insert a back-door in some executables or libraries
of the Linux kernel. Such a compiler could recognize when it is compiled by
itself and duplicate the malicious code in the resulting compiler, making it
malicious as well.
Ken Thompson presented
the idea of a malicious compiler during the Turing Award Lecture Reflections on Trusting Trust in 1984.
The live-bootstrap approach
The live-bootstrap
approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small binary file. Next the
Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4,
4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last
compiler can be used to compiler the Linux kerner sources.
For executables, Linux makes use of Executable and Linkable Format (ELF) files. These files start with a
header, followed by the program header table, an number of code and/or data
segments, and (optionally) a number of section header tables, which include
symbol tables with information that can be used by a debugger.
hex0
The small binary is a program called hex0, which converts a file with
hexadecimal numbers into a binary file. This program is specific for the
target CPU for which Linux needs to be build. These programs can be found in
the bootstrap-seeds
repository. In the README.md file it states: 'NEVER TRUST ANYTHING IN
HERE'. For x86 (32 bits)
the hexadecimal representation of this program is given in the file hex0_x86.hex0, meaning that if this file is compiled with hex0 it will
return hex0, assuming that it is not a malicious variant of hex0. The hex0
program requires two arguments, the names of the input and output files. The
function of the program can also be achieved with following command line:
sed 's/[;#].*$//g' $input_file | xxd -r -p > $output_file
This makes use of the stream editing program sed and the xxd
(hex dump) program that can be used to convert, in both directions, between
binary files and hexadecimal files.
Brainfuck
Brainfuck is an esoteric
program language with only eight commands, represented by the letters
'+-<>[].,'. I wanted to see if it was possible to write a Brainfuck
program to replace hex0. I wrote some JavaScript to generate a program from a
simple programming language: BF generator.
I verified that with some Brainfuck interpreter running it on hex0_x86.hex0 did
produce a file equal to hex0.
Blog posts:
Parsing kaem files
I decided to look some deeper into live-bootstrap to figure out which programs
are executed and which files are being read. There is a global description of
all the steps that is found in parts.rst. live-bootstrap comes with a script to execute all steps and a
minimal shell, called the kaem shell, to execute the script. I wrote a
program, kaem_parser.cpp, to parse
the kaem files, which generates the HTML page live-bootstrap with all the information.
Blog posts:
Emulator
To even dive deeper, I decided to implement an x86 emulator that implements
only the x86 instructions that are actually used. The emulator also implments
systems calls and multiple processes. I decided to simply map file operations
on file operations on the file system, instead of simulating those in memory.
This allowed me also to verify the code of hex0 and to see if no other binary
files were used.
The development of the emulator required some very complicated debuging, but
it was also insightfull.
The emulator turned out too be slow to compile GNU Mes.
Blog posts:
- October 9, 2023
- October 22, 2023
- November 5, 2023
- November 28, 2023
- December 13, 2023
- February 6, 2024
- February 12, 2024
- February 18, 2024
- March 24, 2024
- April 23, 2024
- May 5, 2024
- May 6, 2024
strace
With strace
command it is possible to trace system calls that are performed during the
execution of a command. I wrote a shell script to execute all the steps with an alternative root (using the
chroot
command) and trace some relevant system calls. I also wrote a program,
called scan_trace.cpp to process the produced log file and generate
an HTML page
listing the processes and source files. It also generates
a JSON
file with similar information, which is used to generate
a T-diagram
GNU Mes
GNU Mes consist of Scheme (a dialect of the LISP family) interpreter programmed in a subset
of C and a C compiler written in Scheme. It is a rather complete compiler
including a separate linker. I get the idea that it is on par with the Tiny C
compiler, which does not need a separate linker, because is not needed for
compiling the Tiny C compiler. The MES compiler is rather slow, which is
understandable because it is an interpreting runing an interpreting compiler.
It is also slow because it compiles many individual C files into object files,
which are linked at the end. Everytime all the Scheme files are loaded before
the compiler can start its actual work.
The GNU Mes compiler requires a compiler for a minimal subset of C. Because of
this the bootstrap does contain include a subset of C compiler and a a partial
implementation of the C preprocessor. These are written in another subset of C for which there
is compiler written in machine language, one for each supported CPU. I have the
impression that the GNU Mes compiler is a bit of overkill for the gap it needs
to fill.
On several forums, I have read people raising the question why Forth was not
used in the implementation instead of LISP.
I made some investigation and attempts to bridge the gap. To see if it would
be possible to implement a compiler with M2-Mesoplanet. I encountered some bugs
in M2-Mesoplanet.
Blog posts:
Stack based language
Due to the bugs found in M2-Mesoplanet, I looked into implementing a stack
based, Forth like, language that would be close to C and simple to compile. A
language that could also serve as an intermediate language. For an
example program see test.sl, which is an attempt to implement tcc_cc.c. The file stack_c.cpp contains the compiler for the stack based language.
Blog posts: