This is a page for a workshop at MCH2022
about parsing. It serves as both an interactive page to be used during the
workshop and as an introduction to parsing. Use the show buttons to reveal
the explainatory texts.
What is 3 + 4?
(type '=' in the input field above to see the answer.)
Most of us when we read the text '3 + 4' will immediately think
about seven as the answer. To arrive at this answer, we have read the symbol
'3' and realized that it stands for the number three. Likewise we read four
for the symbol '4'. We also have been taught that the symbol '+' stands for
adding and in our head we add the number three and four to arrive at seven.
There are some more implicite rules that we employ. We do read the text
'3 3' as twice the nunber three, were we read '33' as thirty three. So a
space does matter when placed between digits, whereas we read '3+4' the same as
'3 + 4'.
What is 3 + 2 * 2 - 1? 9 or 6?
If you enter 3 + 2 * 2 - 1 in a pocket calculator the answer will probably be
9, because adding three and two gives five. If you next multiply this with two
you get ten. If you subtract one from ten you arrive at nine.
However, if you take the rule that multiplication (and division) goes before
adding and substracting, you will start with multiplying two with two and if
you substract one from three added with four, you arrive at six.
In the first case you assume that all operators have the same priority in the
second you assume that some operations have a higher priority than others.
You can force one of the two interpretations by using round brackets:
(3 + 2) * 2 - 1 will return nine
3 + (2 * 2) - 1 will return six
Try enter these expression in the input field of the previous section to see
which operator priorities are implemented in this expression evaluator.
What is 12 / 2/2? 12 or 3?
2/2 is one and twelve divided by one is twelve. But 12 divided by 2 is six and
six divided again by 2 is three. Assuming that spaces have no semantic meaning,
than the question here is whether the divide operator is left or right
associative. Depending on this, the brackets are thought to be placed as:
Left associative: (12 / 2) / 2
right associative: 12 (2 / 2)
Note that for 3 * 2 * 5 it does not matter whether we place the brackets
like (3 * 2) * 5 or 3 * (2 * 5).
Abstract syntax tree
An alternative way to represent expressions without using brackets is to use
a tree. (By custom these trees are drawn up-side down.) Below such a tree is
shown for the given expression. Experiment with changing the expression. If
the expression is not correct, three question marks are displayed.
Such a tree is also called an abstract syntax tree, because it abstracts from
the concrete syntax using characters. From the syntax tree, one cannot see if
brackets were used in the input, except when they do make a difference in how
the expression should be calculated.
To calculate the value of the
syntax tree below, find a node that only has numbers below them and replace
that node by the result of applying the operation to the numbers. Repeat this
until there is no such node any more.
Railroad diagrams or production rules
There are various ways to specify a grammar for how an input text needs to be
transformed into an abstract syntax tree. Some of these are:
The idea of railroad diagrams is that you represent the grammar by a number of
boxes and lines with switches that connect the boxes. The boxes can contain
characters or the names of other railroad diagrams. An example of railroad
diagrams for arithmetic expressions can be found in
Parsing a Simplified Grammar. (This page also contains some production
rules, but these production rules are incorrect.)
Railroad diagrams are nice to explain a grammar, but if you want to use them
for actually parsing some input you need some tool that can interpret them.
In the rest of this workshop we are going to work with grammars that are
specified with production rules.
Terminals
Terminals in a formal langual are like the lexical elements of a natural
language. In a natural language we have words, names, numbers, and punctuation
marks. In programming language, we have integers, floating point numbers,
identifiers, strings (a sequence of characters between quotation marks),
keywords, operators and such.
Below a production rule is given for parsing four terminals. On the left-hand
side of the colon, the word 'root' is given. This is the starting
point of the grammar. On the right-hand side of the colon, the names of four
terminals are given. The word 'int' is denoted to represent an integer
number, the word 'ident' is denoted to represent an identifier, the
word 'string' is denoted to represent a string between double
quotes, and the word 'char' is denoted to represent a single
character placed between single quotes. A period is used to denote the end of
the production rule.
Literals and keywords
Two other types of terminals in programming languages are literals and
keywords. In a grammar we can define these by placing some text in between
double quotes. If the text consists of only alphabetic characters, it is taken
as a keyword. (Sometimes keywords are also called reserved words.) Keywords
usually overlap with identifiers. It is thus logical to exclude keywords from
the identifiers. This can be tested in the input by replacing the a by
if. For literals the rule is that the characters should appear
directly after each other, without space between them. This can be tested in
the input by inserting a space between the two colons.
Non-terminals
Below a grammar consisting of three rules is shown to parse a very simple
expression. In this grammar the text expr is a non-terminal.
(Actually, the text root is also a non-terminal. A special one that
denotes the start of the grammar.) The non-terminal expr is used in
the first rule to represent the experssion that is followed by an equal sign.
The last two rules specify that the expression may consist of either
integer followed by a plus sign and another integer or a single integer.
In our grammar definition, the order of the rules does matter. The first
grammar rule of a non-terminal that parses a text takes precedence over
following rules. If one swaps the last two rules in the example below, the
input will no longer be parsed.
Note that in the output the plus sign is not shown.
Add and substract
In the grammar below a grammar rule with a minus sign is included for
substraction. Replace the plus sign in the input with a minus sign to
see the effect in the output. Note that the resulting output is the same.
See next section for how to distinguish between them.
Syntax tree labeling
To make a difference between these two rules in the abstract syntax tree, we
have added the words add and sub between square brackets at
the end of the rules. Replace the plus sign in the input with a minus sign to
see the effect in the output.
Left and right recursion
The above grammar does not allow us to parse additions of more than two
numbers. In the grammar below, the grammar rules have been slightly modified.
Note that in the second and third rule the first terminal int has been
replaced by the non-terminal expr. Rules where the non-terminal
appears as the first element are called left recursive. If we would have replaced the second terminal int
in both grammar rules with expr instead, we would have gotten right
recursion. Try this yourself by modifying the grammar.
Bar symbol
As a shorthand it is possible to use the bar symbol ('|') to combine the
various rules for one non-terminal into a single rule. This is shown in the
grammar below, which is the grammar that was used at the start for arithmetic
expressions.
Extensions: SEQ, OPT, CHAIN, and LIST
To avoid having to introduce many non-terminals for common language constructs
such as sequences and optional elements, a number of extensions are introduced.
First of all, round brackets can be used to make subrules. The keyword
SEQ can be used to express that the element (or subrule) before it can
occur one or more times. The keyword OPT can be used to indicate that
the element is optional. The keyword CHAIN can be used to define a
chain of one or more elements separated with the literal that follows the
keyword. The keyword LIST is equivalent with a sequence of elements chained
with commas. The OPT keyword can be combined with the other keywords.
'Natural' programming languages. Just like natural languages have things
in common and follow certain universal patterns, the same is true for programming languages. So, lets
focus on those common patterns. For example, if a programming language has
reserved words, these usually are reserved everywhere and not restricted to
some contexts.
Just interpret the grammar. Lets not process the grammar, but see if we
can write an interpreter that tries to parse the input following the grammar
rules. That means a back-tracking recursive descent parser. Yes, bottom-up
parser, such as the LR
parsers invented by Donald Knuth can parse input in linear time, but they
require a costly compilation step that often generates large tables. It is a
common experience that writing a grammar for yacc (or bison) is a lot of work especially when having to deal with
shift-reduce conflicts.
Memory is not an issue. Nowadays it is no problem to load a source
file in memory. Lets just use caching of intermediate results: remember at
every location in the file which non-terminals are parsed or not.
Compiled not always faster. I made an attempt to modify the interpreter
into a compiler that would generate C code (by simply applying partial
evaluation). Compiling this code took a long time. Secondly and it also
executed slower. Why? I guess it caused more CPU cache misses.
Hard-coded scanner. Lets just start with a basic set of common terminals
used in programming languages and use a hard-coded scanner.
Scanner called by parser. It makes more sense to let the grammar
dictates which terminals are to be accepted at some location, than to use a
first pass which breaks down the input into terminals. This is usually called
tokenizing. Experience has proven that this is often not possible without
introducing states into this first pass, states that are a reflection of the
grammar. If we would have a terminal for Roman numerals it is possible to write a grammar for which it is very
hard to write a tokenizer that knows when an identifier should be read as a
Roman numeral or simply as an identifier. Below an example is given, where
roman_numeral is a terminal accepting a Roman numeral.
IParse Studio is a good tool for prototyping a grammar, as it allows
you quickly to test certain grammar ideas. IParse tells you what it is
expecting at the first location where an error occurs. It has no error
recovery.
IParse is also suitable for implementing an embedded scripting language with
an interpreter where only short scripts are processed and error recovery
is not important.
Using a generic syntax tree is a disadvantage because you need to write
code to process it. It is also not easy to attach specific data (think symbol
tables) to certain nodes to a generic tree. For more advance analyses you
probably would need to transform it into a more specific data structure.
Production grade compilers often use hand-coded parsers with smart
error recovery.
The intermediate solution is to use an LALR parser generator
such as Yacc/Bison in compination with a scanner generator such as Lex/Flex.
It is nice to already have a tested grammar before starting to use a parser
generator. Yacc and Bison do not support some of the extensions that IParse
supports, which will require you to rewrite the grammar rules into multiple
rules.
Single file C program. This was the version it all started with. The
realization that with some caching, you can implement a reasonable fast
parser without having to do many complicated things. This version does
include an attempt to also deal with context scoping rules and identifier
matching. However, it was bit too simplistic.
Java implementation. A quick reimplementation in Java, probably not
how you should do it in Java. Was not really further developed.
C++ implementation. This implementation in C++ grew through the years.
Contains various implementations of the parser, several kind of scanners,
support for code pages and unicode support, some more grammar extensions,
introduction of 'white-space terminals', and the implementation of an unparser.
Used in production as the parser for
BiZZdesign scripting language. It was also used for implementing parsing
and unparsing windows resource (.rc) files to aid language
translations.
RawParser. Consists of a single C-file, which, by example, shows how a grammar driven, scannerless
parser can be implemented. It was also an intended to be educational by
extensive comments and incremental exposition with examples with unit tests.
However I found the requirements that C put on the order of statements a bit
restrictive.
Literate programming with MarkDown. I came up with the idea for a form
of literate programming using MarkDown files. Traditional literate programming
works by introducing identifiers for fragments of code and uses subsitution to
put all the fragments in the right order. Instead of this the program parsers the fragments and automatically puts them in the right
order. It also introduces some syntax to extend earlier defined structures and
functions. I started using it for the documentation of RawParser
JavaScript implentation. Last year, for a similar workshop like this, I
made a very basic JavaScript implementation embedded in an HTML file. For this
workshop I placed it in a single JavaScript file.
Resources
AGL Editor Demo:
A simular approach, with a more advanced editor (with syntax-colour)