Lexical Analysis and Regular Expressions

Tokens

The first step of a compiler is lexing (aka scanning or tokenizing). The goal is to break up an input stream into tokens. For example, given the input stream containing the characters

if (b == 0) a = "hi"

A natural tokenization would be:

|if|(|b|==|0|)|a|=|"hi"|

where the vertical lines indicate token boundaries. Token typically fall into certain recognizable classes, such as identifiers (x, item_count, f9), keywords (if, return), literal constants (23, -1.23e-5, "Hello World!\n"), and operators (+, %, --, >>=). Most lexers discard whitespace and comments, since these are not needed by later stages of the compilers; an exception would be a parser for a system like Javadoc that generates documentation from the contents of appropriately placed comments.

If we were going to write a Java type signature for a lexer, it might be a function that accepts a Reader and returns Iterator<Token>, though the signature should also have some provision for reporting lexical errors when the input stream cannot be broken up into legal tokens.

Many tokens contain useful information. For example, an integer literal contains the actual integer. These are called token attributes. In a language like Java, it would be natural to represent tokens as objects whose token attributes are arguments to a constructor. Thus, we might represent an identifier a as an object new Identifier("a"), a floating point literal 10 as new Number(10.0), or a string literal "hello\n" as new StringLit("hello\n"). Notice that in the case of the string literal, the input for the token contains 7 characters whereas the token attribute has already been interpreted as the corresponding 6-character string.

In addition to the information content of the token, we usually want additional information in each token: its position in the input. This position would often be described by the name of the input file, the line in that file, and the column in that file (or a range of columns). This position information is used later compiler stages if the compiler needs to report the location of an error.

Ad-hoc lexing

It is good to have a principled way to build lexers. Often programmers are too impatient and end up writing lexing code by hand. This approach turns out to be surprisingly error-prone and leads to a large fraction of security vulnerabilities in real systems, particularly when programmers write their lexers in an unsafe language like C and try to make them as fast as possible.

One challenge of lexing is that tokens can overlap—the lexer can't tell, for example, if it is seeing a keyword or a variable identifier until it has read multiple characters. A second challenge is that the lexer typically needs to look ahead at characters beyond the current token in order to decide where the token ends. Unfortunately the standard I/O interfaces typically don't support this lookahead! For example, we might think that we could write Java code to scan an identifier like the following, assuming an appropriate method isIdChar():

Reader input;
...
while (true) {
    int c = input.read();
    if (!isIdChar(c)) break;
    id.append((char) c);
}

Do you see the problem with this code? It reads, and discards, the character just after the identifier. This might not be a problem if the character is whitespace, but in general the discarded character will be part of the next token in the input stream.

The solution to this problem is to use a richer abstraction for reading input, one that allows client code to peek at the lookahead character. Then the code to scan an identifier can be written correctly and simply along the following lines:

Reader input;
...
while (isIdChar(input.peek())) {
    id.append(input.read());
}

Lookahead doesn't solve the problem with overlapping tokens, though, unless the input source also supports backtracking. For an example of an input source that supports both lookahead and arbitrary backtracking, see the Scanner class in the EasyIO package [source].

Regular expressions

A more declarative approach to lexing is to define the legal tokens using regular expressions and then to automatically synthesize the lexer from these regular expressions. This approach is easy and is more likely to result in efficient, maintainable, secure code.

Tokens in programming languages can be quite complex, so regular expressions need to be expressive. For example, C and Java support floating-point literals like 2.0, .1e0, and 20.e99. How do we compactly represent and efficiently recognize the legal floating-point literals?

A regular expression R denotes a set of strings called the language of R, written L(R). Using the regular-expression syntax that is commonly used for lexer generators, we can define the language recognized by various common regular-expression constructs:

Syntax	Language
`a` (ordinary character)	{“a”}
`ε`	{“”} (the set containing just the empty string)
R\|S	L(R) ∪ L(S)
RS	{xy \| x ∈ L(R) ∧ y ∈ L(S)}
R*	L(ε) ∪ L(R) ∪ L(RR) ∪ L(RRR) ∪ ...

Assuming a and b are ordinary characters from the “alphabet” we are using to make tokens, here are some examples of regular expressions and the corresponding language:

aε	{"a"}
a\|b	{"a", "b"}
(ab)*	{"", "ab", "abab", "ababab", ...}
(a\|ε)b	{"ab", "b"}

Using these basic regular-expression constructs, we can define convenient notations for describing the tokens in programming languages, as syntactic sugar:

Syntax	Desugaring
R?	R\|ε
R+	R(R) or RR
`[abc]`	(`a`\|`b`\|`c`) (this is known as a character class)
`[^ab]`	Any character except `a` or `b`
`[a-zA-Z]`	Any alphabetic character
`[^a-zA-Z]`	Any non-alphabetic character
`\x0A`, `\n`	ASCII 10: newline character
`.`	Anything but a newline character: roughly `[^\n]`, ignoring Unicode newline characters
`[^]`	Any character

Lexer Generators

A lexer generator like JFlex, lex, or ocamllex takes in as input a specification consisting of a set of regular expressions specifying the various legal tokens. It outputs code implementing a lexer that breaks ints input into tokens. In the lexer generator spec, along with each token is an action that the lexer performs when the token is matched against the regular expression. For example, the action might be to return an object representing the matched token, or simply to do nothing, ignoring the matched input.

To aid in writing lexical specification, a lexer generator like JFlex also supports defining abbreviations, which can be very helpful for building up complex regular expressions. Once defined, an abbreviation can be used to define regular expressions, including in other abbreviations, as in the following:

    digit = [0-9]
    posint = [1-9]{digit}

Keep in mind that abbreviations cannot be recursive; a definition like x = {x}a|ε is illegal because it cannot be fully expanded (not to mention that it doesn't add anything above the definition x = a*.) Recursive abbreviations would give us the full power of context-free grammars, which is overkill for specifying lexical analysis.

The following JFlex example is a good starting point for writing a JFlex spec:

MyLexer.jflex

For certain token classes, such as strings and comments, it is handy to make use of lexer states in which the recognized tokens differ. We can modify the previous JFlex example to support comments, by introducing a COMMENT state to simplify scanning comments. The initial state of the lexer is YYINITIAL; the function yybegin() can be used to change the state of the lexer in an action. Regular expressions associated with a particular lexer state, are only active when the lexer is in that state.

MyLexer2.jflex