The motivation for this project is given in the presentation Reviewing live-bootstrap, which recounts the work I did to review the live-bootstrap project focusing on the initial phase, known as stage0. Besides reviewing and verifying the seeds, also the sources, both assemby and C programs, were reviewed.
The GNU Mes compiler implements a rather complete C compiler in Scheme and uses a Scheme interpreter written in ~5,000 LOC of simple C. In order to also review the GNU Mes compiler, one also has to review the Scheme sources. These sources also contain the 'Simple C' compiler, which covers are substantial subset of C. This raised the question whether it would be possible to extend this compiler such that it would be possible to compile the Tiny C Compiler (TCC). This compiler (on for each target) is written in M1 assemply. For the x86 target this is cc_x86.M1.
For this, I started working on a compiler tailored to only compiling TCC. I started to do this in C to find the minimal subset that was required with respect to syntax and semantics. I made some progress with implementing the C-preprocessor. But due to the bootstrapping problem, I came up with the idea of using a stack based language (called Stack-C) as intermediate language. I made some attempts to manually compile the compiler to the stack based language, but it turned out to be rather error-prone, which urged me to continue working on the C compiler to output Stack-C code. The first milestone came when the C compiler and the Stack-C compiler became self-hosted. But it still proved to be a long way to also being able to compile the TCC sources, because These sources do cover substantial part of the full C. It almost seems like the TCC sources are there own unit test with all kinds of edge cases.
Take for an example the hex0.c program, where line 34 starts with an if-statement, whih looks like:
if (ch <= ' ')
The C compiler compiles this to Stack-C, resulting in:
ch ?1 32 <=s if {
This then is compiled (with stack_c, M1 and hex2) in
the following fragment in hex0.hex0, which
has a three column format, where the first shows the hexadecimal representation
matching the assembly instruction in the second column, which are generated from
the intermediage language shown in the third column:
## hex0.c 34
#:_main_else2 # no else
50 # push_eax # ch (local)
8D85 1C000000 # lea_eax,[ebp+DWORD] %28
8A00 # mov_al,[eax] # ?1
0FB6C0 # movzx_eax,al
50 # push_eax # 32
B8 20000000 # mov_eax, %32
5B # pop_ebx # <=s
39C3 # cmp_eax_ebx
0F9EC0 # setle_al
0FB6C0 # movzx_eax,al
85C0 # test_eax,eax # if
58 # pop_eax
0F84 05000000 # je %_main_else3
(The line with _main_else2 is part of the previous statement.)