MES-replacement project

The goal of this project is to simplify stage0 of live-bootstrap, which involves implementing a replacement for the GNU Mes compiler by implementing a C-compiler in C that can compile the Tiny C Compiler version 0.9.26.

The motivation for this project is given in the presentation Reviewing live-bootstrap, which recounts the work I did to review the live-bootstrap project focusing on the initial phase, known as stage0. Besides reviewing and verifying the seeds, also the sources, both assemby and C programs, were reviewed.

The GNU Mes compiler implements a rather complete C compiler in Scheme and uses a Scheme interpreter written in ~5,000 LOC of simple C. In order to also review the GNU Mes compiler, one also has to review the Scheme sources. These sources also contain the 'Simple C' compiler, which covers are substantial subset of C. This raised the question whether it would be possible to extend this compiler such that it would be possible to compile the Tiny C Compiler (TCC). This compiler (on for each target) is written in M1 assemply. For the x86 target this is cc_x86.M1.

For this, I started working on a compiler tailored to only compiling TCC. I started to do this in C to find the minimal subset that was required with respect to syntax and semantics. I made some progress with implementing the C-preprocessor. But due to the bootstrapping problem, I came up with the idea of using a stack based language (called Stack-C) as intermediate language. I made some attempts to manually compile the compiler to the stack based language, but it turned out to be rather error-prone, which urged me to continue working on the C compiler to output Stack-C code. The first milestone came when the C compiler and the Stack-C compiler became self-hosted. But it still proved to be a long way to also being able to compile the TCC sources, because These sources do cover substantial part of the full C. It almost seems like the TCC sources are there own unit test with all kinds of edge cases.

Easy to review

Another goal that I had in mind was that the new approach would be easier to review. This involves make clear the relationship between the C code, the Stack-C code, and the assembly code. In the C compiler output lines reference to the source placed which start with a hash ('#') followed by the full name of the source file and a line number (separated by a space). The Stack-C compiler will copy these lines verbatim and also put the Stack-C constants and operators as comments in the generated assembly. The M1 and hex2 programs will maintain these as comments.

Take for an example the hex0.c program, where line 34 starts with an if-statement, whih looks like:

       if (ch <= ' ')
The C compiler compiles this to Stack-C, resulting in:
     ch ?1 32 <=s if {
This then is compiled (with stack_c, M1 and hex2) in the following fragment in hex0.hex0, which has a three column format, where the first shows the hexadecimal representation matching the assembly instruction in the second column, which are generated from the intermediage language shown in the third column:
                             ## hex0.c 34
                             #:_main_else2 # no else
50                           #  push_eax              # ch (local)
8D85 1C000000                #  lea_eax,[ebp+DWORD] %28
8A00                         #  mov_al,[eax]          # ?1
0FB6C0                       #  movzx_eax,al
50                           #  push_eax              # 32
B8 20000000                  #  mov_eax, %32
5B                           #  pop_ebx               # <=s
39C3                         #  cmp_eax_ebx
0F9EC0                       #  setle_al
0FB6C0                       #  movzx_eax,al
85C0                         #  test_eax,eax          # if
58                           #  pop_eax
0F84 05000000                #  je %_main_else3
(The line with _main_else2 is part of the previous statement.)

Status

The first four tasks of the project have been completed, resulting in working replacement for the x86 target.

Results


Home