MCH2022: A practical approach to parsing

Most of us when we read the text '3 + 4' will immediately think about seven as the answer. To arrive at this answer, we have read the symbol '3' and realized that it stands for the number three. Likewise we read four for the symbol '4'. We also have been taught that the symbol '+' stands for adding and in our head we add the number three and four to arrive at seven.

There are some more implicite rules that we employ. We do read the text '3 3' as twice the nunber three, were we read '33' as thirty three. So a space does matter when placed between digits, whereas we read '3+4' the same as '3 + 4'.

What is 3 + 2 * 2 - 1? 9 or 6?

If you enter 3 + 2 * 2 - 1 in a pocket calculator the answer will probably be 9, because adding three and two gives five. If you next multiply this with two you get ten. If you subtract one from ten you arrive at nine.

However, if you take the rule that multiplication (and division) goes before adding and substracting, you will start with multiplying two with two and if you substract one from three added with four, you arrive at six.

In the first case you assume that all operators have the same priority in the second you assume that some operations have a higher priority than others.

You can force one of the two interpretations by using round brackets:

(3 + 2) * 2 - 1 will return nine
3 + (2 * 2) - 1 will return six

Try enter these expression in the input field of the previous section to see which operator priorities are implemented in this expression evaluator.

What is 12 / 2/2? 12 or 3?

2/2 is one and twelve divided by one is twelve. But 12 divided by 2 is six and six divided again by 2 is three. Assuming that spaces have no semantic meaning, than the question here is whether the divide operator is left or right associative. Depending on this, the brackets are thought to be placed as:

Left associative: (12 / 2) / 2
right associative: 12 (2 / 2)

Note that for 3 * 2 * 5 it does not matter whether we place the brackets like (3 * 2) * 5 or 3 * (2 * 5).

Abstract syntax tree

An alternative way to represent expressions without using brackets is to use a tree. (By custom these trees are drawn up-side down.) Below such a tree is shown for the given expression. Experiment with changing the expression. If the expression is not correct, three question marks are displayed.

Such a tree is also called an abstract syntax tree, because it abstracts from the concrete syntax using characters. From the syntax tree, one cannot see if brackets were used in the input, except when they do make a difference in how the expression should be calculated.

To calculate the value of the syntax tree below, find a node that only has numbers below them and replace that node by the result of applying the operation to the numbers. Repeat this until there is no such node any more.

Railroad diagrams or production rules

There are various ways to specify a grammar for how an input text needs to be transformed into an abstract syntax tree. Some of these are:

The idea of railroad diagrams is that you represent the grammar by a number of boxes and lines with switches that connect the boxes. The boxes can contain characters or the names of other railroad diagrams. An example of railroad diagrams for arithmetic expressions can be found in Parsing a Simplified Grammar. (This page also contains some production rules, but these production rules are incorrect.) Railroad diagrams are nice to explain a grammar, but if you want to use them for actually parsing some input you need some tool that can interpret them.

In the rest of this workshop we are going to work with grammars that are specified with production rules.

Terminals

Terminals in a formal langual are like the lexical elements of a natural language. In a natural language we have words, names, numbers, and punctuation marks. In programming language, we have integers, floating point numbers, identifiers, strings (a sequence of characters between quotation marks), keywords, operators and such.

Below a production rule is given for parsing four terminals. On the left-hand side of the colon, the word 'root' is given. This is the starting point of the grammar. On the right-hand side of the colon, the names of four terminals are given. The word 'int' is denoted to represent an integer number, the word 'ident' is denoted to represent an identifier, the word 'string' is denoted to represent a string between double quotes, and the word 'char' is denoted to represent a single character placed between single quotes. A period is used to denote the end of the production rule.

Literals and keywords

Two other types of terminals in programming languages are literals and keywords. In a grammar we can define these by placing some text in between double quotes. If the text consists of only alphabetic characters, it is taken as a keyword. (Sometimes keywords are also called reserved words.) Keywords usually overlap with identifiers. It is thus logical to exclude keywords from the identifiers. This can be tested in the input by replacing the a by if. For literals the rule is that the characters should appear directly after each other, without space between them. This can be tested in the input by inserting a space between the two colons.

Non-terminals

Below a grammar consisting of three rules is shown to parse a very simple expression. In this grammar the text expr is a non-terminal. (Actually, the text root is also a non-terminal. A special one that denotes the start of the grammar.) The non-terminal expr is used in the first rule to represent the experssion that is followed by an equal sign. The last two rules specify that the expression may consist of either integer followed by a plus sign and another integer or a single integer.

In our grammar definition, the order of the rules does matter. The first grammar rule of a non-terminal that parses a text takes precedence over following rules. If one swaps the last two rules in the example below, the input will no longer be parsed.

Note that in the output the plus sign is not shown.

Add and substract

In the grammar below a grammar rule with a minus sign is included for substraction. Replace the plus sign in the input with a minus sign to see the effect in the output. Note that the resulting output is the same. See next section for how to distinguish between them.

Syntax tree labeling

To make a difference between these two rules in the abstract syntax tree, we have added the words add and sub between square brackets at the end of the rules. Replace the plus sign in the input with a minus sign to see the effect in the output.

Left and right recursion

The above grammar does not allow us to parse additions of more than two numbers. In the grammar below, the grammar rules have been slightly modified. Note that in the second and third rule the first terminal int has been replaced by the non-terminal expr. Rules where the non-terminal appears as the first element are called left recursive. If we would have replaced the second terminal int in both grammar rules with expr instead, we would have gotten right recursion. Try this yourself by modifying the grammar.

Bar symbol

As a shorthand it is possible to use the bar symbol ('|') to combine the various rules for one non-terminal into a single rule. This is shown in the grammar below, which is the grammar that was used at the start for arithmetic expressions.

Extensions: SEQ, OPT, CHAIN, and LIST

To avoid having to introduce many non-terminals for common language constructs such as sequences and optional elements, a number of extensions are introduced. First of all, round brackets can be used to make subrules. The keyword SEQ can be used to express that the element (or subrule) before it can occur one or more times. The keyword OPT can be used to indicate that the element is optional. The keyword CHAIN can be used to define a chain of one or more elements separated with the literal that follows the keyword. The keyword LIST is equivalent with a sequence of elements chained with commas. The OPT keyword can be combined with the other keywords.

Exercise: Tabloit

Interpreting parser

'Natural' programming languages. Just like natural languages have things in common and follow certain universal patterns, the same is true for programming languages. So, lets focus on those common patterns. For example, if a programming language has reserved words, these usually are reserved everywhere and not restricted to some contexts.

Premature optimization is ... Donald Knuth wrote in the paper Structured Programming with go to Statements: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

Just interpret the grammar. Lets not process the grammar, but see if we can write an interpreter that tries to parse the input following the grammar rules. That means a back-tracking recursive descent parser. Yes, bottom-up parser, such as the LR parsers invented by Donald Knuth can parse input in linear time, but they require a costly compilation step that often generates large tables. It is a common experience that writing a grammar for yacc (or bison) is a lot of work especially when having to deal with shift-reduce conflicts.

Memory is not an issue. Nowadays it is no problem to load a source file in memory. Lets just use caching of intermediate results: remember at every location in the file which non-terminals are parsed or not.

Compiled not always faster. I made an attempt to modify the interpreter into a compiler that would generate C code (by simply applying partial evaluation). Compiling this code took a long time. Secondly and it also executed slower. Why? I guess it caused more CPU cache misses.

Hard-coded scanner. Lets just start with a basic set of common terminals used in programming languages and use a hard-coded scanner.

Scanner called by parser. It makes more sense to let the grammar dictates which terminals are to be accepted at some location, than to use a first pass which breaks down the input into terminals. This is usually called tokenizing. Experience has proven that this is often not possible without introducing states into this first pass, states that are a reflection of the grammar. If we would have a terminal for Roman numerals it is possible to write a grammar for which it is very hard to write a tokenizer that knows when an identifier should be read as a Roman numeral or simply as an identifier. Below an example is given, where roman_numeral is a terminal accepting a Roman numeral.

Demonstration: C-grammar

Target application

IParse Studio is a good tool for prototyping a grammar, as it allows you quickly to test certain grammar ideas. IParse tells you what it is expecting at the first location where an error occurs. It has no error recovery.

IParse is also suitable for implementing an embedded scripting language with an interpreter where only short scripts are processed and error recovery is not important.

Using a generic syntax tree is a disadvantage because you need to write code to process it. It is also not easy to attach specific data (think symbol tables) to certain nodes to a generic tree. For more advance analyses you probably would need to transform it into a more specific data structure.

Production grade compilers often use hand-coded parsers with smart error recovery.

The intermediate solution is to use an LALR parser generator such as Yacc/Bison in compination with a scanner generator such as Lex/Flex. It is nice to already have a tested grammar before starting to use a parser generator. Yacc and Bison do not support some of the extensions that IParse supports, which will require you to rewrite the grammar rules into multiple rules.

Version history

Single file C program. This was the version it all started with. The realization that with some caching, you can implement a reasonable fast parser without having to do many complicated things. This version does include an attempt to also deal with context scoping rules and identifier matching. However, it was bit too simplistic.

Java implementation. A quick reimplementation in Java, probably not how you should do it in Java. Was not really further developed.

C++ implementation. This implementation in C++ grew through the years. Contains various implementations of the parser, several kind of scanners, support for code pages and unicode support, some more grammar extensions, introduction of 'white-space terminals', and the implementation of an unparser. Used in production as the parser for BiZZdesign scripting language. It was also used for implementing parsing and unparsing windows resource (.rc) files to aid language translations.

RawParser. Consists of a single C-file, which, by example, shows how a grammar driven, scannerless parser can be implemented. It was also an intended to be educational by extensive comments and incremental exposition with examples with unit tests. However I found the requirements that C put on the order of statements a bit restrictive.

Literate programming with MarkDown. I came up with the idea for a form of literate programming using MarkDown files. Traditional literate programming works by introducing identifiers for fragments of code and uses subsitution to put all the fragments in the right order. Instead of this the program parsers the fragments and automatically puts them in the right order. It also introduces some syntax to extend earlier defined structures and functions. I started using it for the documentation of RawParser

JavaScript implentation. Last year, for a similar workshop like this, I made a very basic JavaScript implementation embedded in an HTML file. For this workshop I placed it in a single JavaScript file.

Workshop: A practical approach to parsing

What is 3 + 4?