Context-Free Grammars

We have now completed lexical analysis. Our input program has been converted into a stream of tokens. The next step is syntactic analysis, or parsing, in which the compiler will read the token stream and recognize whether the tokens represent valid language syntax. Typically, the parser will also construct a tree representation of the program, for use by later compiler phases.

The parser does not in general complete the task of recognizing whether the program is legal; for example, a compiler will often check that types of expressions agree appropriately and that variables are declared. These static checks are usually the job of the next compiler phase, semantic analysis.

Parsing is a well-established part of computer science, but it is worth understanding how to do it well because many programs, not just compilers, need to parse information.

Context-free grammars

We take the approach in this course of building implementations from precise specifications. How might we specify the legal syntax of programs in a programming language? We saw that for the problem of specifying legal tokens, regular expressions give us a sufficiently expressive specification language. Unfortunately, regular expressions are not expressive enough for most programming languages.

An easy way to see this is by considering the need to check whether parenthesis and other grouping constructs are properly balanced. In order to balanced parentheses, it is necessary to count the number of open parentheses that have not yet been closed. However, regular expressions can be recognized by finite automata, whereas to count an unbounded number of parentheses would require an unbounded number of states. No finite automaton can keep track of the number of unbalanced parentheses, so regular expressions are not powerful enough to specify language syntax.

This limitation of regular expressions motivates the use of context-free grammars (CFGs) to specify language syntax.

A context-free grammar is characterized by a set of productions. Each production explains how to rewrite a nonterminal symbol into some sequence of terminal symbols (tokens) or nonterminal symbols.

For example, we can define a language of arithmetic expressions, including balanced parentheses, using a CFG with nonterminals S and E:

S → S + S
S → S * S
S → E
E → n
E → ( S )

As an abbreviation, we typically borrow regexp notation and write all the right-hand sides for a given nonterminal together:

S → S + S | S * S | E
E → n | ( S )

The nonterminal S is the start symbol for the grammar. The symbols +, *, n, (, and ) are terminals (tokens). (Here, the token n represents some literal number.)

Derivations

Given a grammar G, a string is in its language L(G) if there exists some derivation of the string using G. A derivation is a sequence of rewrites that starts with the start symbol and eventually rewrites it into the string. Each step in the sequence applies some production of the grammar: it does a rewrite αAβ → αγβ where A→γ is a production. The grammar is called “context-free” because whether the production applies doesn't depend on the surrounding context α and β.

For example, the string “(1+4)+2” is in the language of the grammar. We can see this by constructing a derivation:

S → S + S → E + S → ( S ) + S
  → ( S + S ) + S → ( E + S ) + S
  → ( 1 + E ) + S → ( 1 + 4 ) + S
  → ( 1 + 4 ) + E → ( 1 + 4 ) + 2

This derivation is a leftmost derivation: at each step, the nonterminal that is expanded is the leftmost nonterminal in the string. As we see soon, leftmost derivations are constructed by top-down parsers. It is also possible to construct a rightmost derivation for any string, expanding the rightmost nonterminal at each step. For example, the rightmost derivation of the same string proceeds as follows:

S → S + S → S + E → S + 2
  → ( S ) + 2 → ( S + S ) + 2
  → ( S + E ) + 2 → ( S + 4 ) + 2
  → ( E + 4 ) + 2 → ( 1 + 4 ) + 2

Parse trees and ambiguity

A derivation corresponds to a parse tree, in which the root is the start symbol and the leaves are the final string. Both of the derivations above correspond to the following parse tree:

Although these two derivations both generate the same parse tree, it is also possible in general that a string can have more than one parse tree. When that happens, the grammar is said to be ambiguous. An ambiguous grammar is one for which there is some string with multiple parse trees.

Our example grammar is in fact ambiguous. For example, consider the string “2 + 3 * 4”. It has one parse tree in which the production S→S+S is used at the root, but another parse in which the production S→S*S is used at the root. Considered as arithmetic expressions, the first parse tree corresponds to the expected value of the expression, 14. The second parse tree corresponds to computing 20: arguably, the wrong answer. The problem with the grammar is that it does not enforce precedence or associativity of the operators + and *.

Enforcing precedence and associativity

Enforcing precedence and associativity is a common problem when designing a grammar. Fortunately, there are some standard ways to do it.

The problem with the original grammar is that it allows the production S→S+S to appear in the parse tree directly under the production S→S*S. We can prevent such parse trees by introducing separate nonterminals to represent difference precedence levels. For example, we can say that an arithmetic expression is a sum of one or more terms (T), where each term is a product of one or more factors (F):

S → S + T | T
T → T * F | F
F → n | ( S )

This grammar prevents a sum from appearing under a product, unless it is surrounded by parentheses.

This grammar also enforces left-associativity of the operators + and *. This restriction arises because the production S→S+T is left-recursive: the symbol S appears in the productions only as the first symbol. Conversely, if we want a binary operator to be right-associative, we would use a right-recursive production like S→T+S.

Some parser generators, such as CUP, yacc, and bison allow precedence and associativity ambiguities to be resolved through precedence declarations. For example, in CUP, the following declarations make PLUS and TIMES left-associative and make TIMES higher-precedence than PLUS.

precedence left PLUS;
precedence left TIMES;

Precedence declarations prevent parses that violate the precedence and associativity constraints; we see later how this works in more detail.

The dangling-else problem

Most programming languages have an if-then-else construct, which naturally leads to ambiguity. Consider the following simplified grammar for statements:

S → if E then S
  | if E then S else S
  | other

If we use this grammar to parse a string of the form if E1 then if E2 then S1 else S2, there are two possible parses: one in which the “else” clause attaches to the first “if” and one in which it attaches to the second. Most languages implement the closest-if rule: an else attaches to the closest preceding if.

To solve this problem, we observe that there are two kinds of statements: unmatched statements that end in a string of the form then S, and matched statements such as if E then S1 else S2 that do not. The insight is that we cannot allow an unmatched statement to be followed by an else, because then the closest-if rule would require the else clause to be part of the unmatched if.

This intuition can be captured by introducing separate nonterminals to represent matched (M) and unmatched (U) statements:

M → if E then M else M
  | other
U → if E then S
  | if E then M else U

The key is that both productions that include else require that the preceding statement be a matched statement.

In CUP and similar parser generators, it turns out the dangling-else problem can also be solved through precedence declarations:

precedence nonassoc IF;
precedence nonassoc ELSE;

Other parsing issues

Some programming languages pose extra challenges for parsing. For example, significant whitespace has (after a multi-decade hiatus) again become popular in languages like Python, Haskell, and Javascript. Indentation levels in Python and Haskell implicitly signal the grouping of expressions. The usual strategy for parsing languages with this feature is for the lexer to insert additional tokens at points where the indentation level changes. These are known as indent/dedent tokens. These tokens can then be used in the language grammar as if they were invisible parenthesis or braces.

This kind of strategy is not needed for the Xi language, because although it makes semicolons optional, it is not whitespace-sensitive.

Another challenge is languages that distinguish different categories of identifiers. The lexer should generate different token types for different kinds of identifiers. In some languages, like OCaml, this distinction is not a problem because the identifier category can be determined purely lexically. However, the languages C and C++ are harder to handle. For example, the correct parse for the statement

HashTable<K,V> x;

depends on whether Hashtable is the name of a class. If it is the name of an integer variable, then the statement actually describes two comparisons among four variables!

To generate different kinds of tokens for different kinds of identifiers, information from later stages of the compiler are needed while performing lexical analysis. This lexer feedback makes the compiler more complex and fragile, since compilation is no longer a simple linear sequence of transformations.

Another technique for dealing with ambiguous parses is to defer the problem until later compiler phases when more information is available, such as the entire abstract syntax tree for the program. This approach works well when it is not necessary to resolve the ambiguity to get the structure of the parse tree mostly right (which is not the case in C++, as the example shows).

Recursive-Descent Parsing

Recursive-descent parsers are an example of top-down parsers, decide how to build the parse tree from the top down as they read in input. Perhaps the most familiar example of a top-down parser is a recursive-descent parser, in which the parser is structured as a set of potentially mutually recursive functions. However, top-down parsers can also be structured as table-driven code that uses a stack instead of recursion. Top-down parsers are known as LL parsers, where the first L stands for “left-to-right” and the second L stands for “leftmost”. An LL parser reads its input left-to-right and it constructs a leftmost derivation as is does so.

LL parsers cannot handle all grammars, or even all unambiguous grammars. An example of a grammar they can handle is the following grammar for s-expressions, which were invented by John McCarthy for the programming language Lisp, a precursor of modern functional languages.

S → x | ( L )
L → ε | S L

An example of an s-expression is the expression (let ((x 2)) (+ x 1)), which in the programming language Scheme (a Lisp descendant) is roughly equivalent to the OCaml expression let x = 2 in x + 1.

Because of the nice properties of this grammar, we can directly write a recursive function to recognize an input stream containing an expansion of the nonterminal S. Here is the pseudo-code:

void parseS() throws SyntaxError {
    switch (peek()) {
        case ID: // S → x
            consume(ID);
            return;
        case LPAREN: // S → ( L )
            consume(LPAREN);
            parseL();
            consume(RPAREN);
            return;
    }
    throw syntaxError(peek());
}

Given that we are expecting an S on the input, the next token we see immediately allows predicting which S production should be used. Hence, an LL parser is also known as a predictive parser: given an expected nonterminal and the lookahead token, it can predict the production to use.

Similarly, we can implement a recursive function to recognize an L on the input stream:

void parseL() throws SyntaxError {
    switch (peek()) {
        case ID:     // L → S L 
        case LPAREN: // L → S L 
            parseS();
            parseL();
            return;
        case RPAREN: // L → ε
            return;
    }
    throw syntaxError(peek());
}

In a recursive-descent parser, nonterminals in the grammar are implemented as functions, and productions in the grammar become cases within these functions. The parse tree corresponds directly to the call tree as the parser executes.

From recognizer to parser

The s-expression code above is merely a recognizer that tells us whether the input can be interpreted as an s-expression. It doesn't produce any representation of the s-expression it recognizes. However, we can convert it into a parser that produces such a representation with only modest effort.

We modify each of the functions to return a value of the appropriate type. For example, we might have parseS return an Object, and parseL can return a LinkedList<Object>. Each production handled in the parser builds the appropriate object from the result of parsed subexpressions:

Object parseS() throws SyntaxError {
    switch (peek()) {
        case ID:                                // S → x
            return consume(ID);
        case LPAREN:                            // S → ( L )
            consume(LPAREN);
            Object result = parseL();
            consume(RPAREN);
            return result;
    }
    throw syntaxError(peek());
}

LinkedList<Object> parseL() throws SyntaxError {
    switch (peek()) {
        case ID:                                // L → S L 
        case LPAREN:                            // L → S L 
            Object first = parseS();
            LinkedList<Object> rest = parseL();
            rest.push(first);
            return rest;
        case RPAREN:                            // L → ε
            return new LinkedList<>();
    }
    throw syntaxError(peek());
}

For this example, the abstract syntax tree has been kept deliberately simple. In a parser for a real compiler, we would probably want to build the tree out of many more node types, corresponding to different language constructs and including additional information such as the position in the input where the syntax was located.

Summary

While recursive descent parsing is appealingly simple, it doesn't work for all grammars, and for more complex grammars, predicting the correct production involves some subtleties. Our next step is to see how to automate the construction of general LL parsers.