The first step of a compiler is lexing (aka scanning or tokenizing). The goal is to break up an input stream into tokens. For example, given the input stream containing the characters
if (b == 0) a = "hi"A natural tokenization would be:
|if|(|b|==|0|)|a|=|"hi"|
where the vertical lines indicate token boundaries. Token typically fall into
certain recognizable classes,
such as identifiers (x
, item_count
,
f9
), keywords (if
, return
), literal
constants (23
, -1.23e-5
, "Hello
World!\n"
), and operators (+
, %
,
--
, >>=
). Most lexers discard whitespace and
comments, since these are not needed by later stages of the compilers; an
exception would be a parser for a system like Javadoc that generates
documentation from the contents of appropriately placed comments.
If we were going to write a Java type signature for a lexer, it might be a
function that accepts a Reader
and returns
Iterator<Token>
, though the signature should also have some
provision for reporting lexical errors when the input stream cannot be
broken up into legal tokens.
Many tokens contain useful information. For example, an integer literal
contains the actual integer. These are called token attributes. In a
language like Java, it would be natural to represent tokens as objects whose
token attributes are arguments to a constructor. Thus, we might represent an
identifier a
as an object new Identifier("a")
, a
floating point literal 10 as new Number(10.0)
, or a string literal
"hello\n"
as new StringLit("hello\n")
. Notice that in
the case of the string literal, the input for the token contains 7 characters
whereas the token attribute has already been interpreted as the corresponding
6-character string.
In addition to the information content of the token, we usually want additional information in each token: its position in the input. This position would often be described by the name of the input file, the line in that file, and the column in that file (or a range of columns). This position information is used later compiler stages if the compiler needs to report the location of an error.
It is good to have a principled way to build lexers. Often programmers are too impatient and end up writing lexing code by hand. This approach turns out to be surprisingly error-prone and leads to a large fraction of security vulnerabilities in real systems, particularly when programmers write their lexers in an unsafe language like C and try to make them as fast as possible.
One challenge of lexing is that tokens can overlap—the lexer can't tell,
for example, if it is seeing a keyword or a variable identifier until it has
read multiple characters. A second challenge is that the lexer typically
needs to look ahead at characters beyond the current token in order to
decide where the token ends. Unfortunately the standard I/O interfaces
typically don't support this lookahead! For example, we might think that we
could write Java code to scan an identifier like the following, assuming an
appropriate method isIdChar()
:
Reader input; ... while (true) { int c = input.read(); if (!isIdChar(c)) break; id.append((char) c); }
Do you see the problem with this code? It reads, and discards, the character just after the identifier. This might not be a problem if the character is whitespace, but in general the discarded character will be part of the next token in the input stream.
The solution to this problem is to use a richer abstraction for reading input, one that allows client code to peek at the lookahead character. Then the code to scan an identifier can be written correctly and simply along the following lines:
Reader input; ... while (isIdChar(input.peek())) { id.append(input.read()); }
Lookahead doesn't solve the problem with overlapping tokens, though, unless the
input source also supports backtracking. For an example of an input source that
supports both lookahead and arbitrary backtracking, see the
Scanner
class in the
EasyIO
package [source].
A more declarative approach to lexing is to define the legal tokens using regular expressions and then to automatically synthesize the lexer from these regular expressions. This approach is easy and is more likely to result in efficient, maintainable, secure code.
Tokens in programming languages can be quite complex, so regular expressions
need to be expressive. For example, C and Java support floating-point literals
like 2.0
, .1e0
, and 20.e99
. How do we
compactly represent and efficiently recognize the legal floating-point
literals?
A regular expression R denotes a set of strings called the language of R, written L(R). Using the regular-expression syntax that is commonly used for lexer generators, we can define the language recognized by various common regular-expression constructs:
Syntax | Language |
---|---|
a (ordinary character) | {“a”} |
ε | {“”} (the set containing just the empty string) |
R|S | L(R) ∪ L(S) |
RS | {xy | x ∈ L(R) ∧ y ∈ L(S)} |
R* | L(ε) ∪ L(R) ∪ L(RR) ∪ L(RRR) ∪ ... |
Assuming a and b are ordinary characters from the “alphabet” we are using to make tokens, here are some examples of regular expressions and the corresponding language:
aε | {"a"} |
a|b | {"a", "b"} |
(ab)* | {"", "ab", "abab", "ababab", ...} |
(a|ε)b | {"ab", "b"} |
Using these basic regular-expression constructs, we can define convenient notations for describing the tokens in programming languages, as syntactic sugar:
Syntax | Desugaring |
---|---|
R? | R|ε |
R+ | R(R*) or R*R |
[abc] | (a|b|c) (this is known as a character class) |
[^ab] | Any character except a or b |
[a-zA-Z] | Any alphabetic character |
[^a-zA-Z] | Any non-alphabetic character |
\x0A, \n | ASCII 10: newline character |
. | Anything but a newline character: roughly [^\n] , ignoring Unicode newline characters
|
[^] | Any character |
A lexer generator like JFlex, lex, or ocamllex takes in as input a specification consisting of a set of regular expressions specifying the various legal tokens. It outputs code implementing a lexer that breaks ints input into tokens. In the lexer generator spec, along with each token is an action that the lexer performs when the token is matched against the regular expression. For example, the action might be to return an object representing the matched token, or simply to do nothing, ignoring the matched input.
To aid in writing lexical specification, a lexer generator like JFlex also supports defining abbreviations, which can be very helpful for building up complex regular expressions. Once defined, an abbreviation can be used to define regular expressions, including in other abbreviations, as in the following:
digit = [0-9] posint = [1-9]{digit}
Keep in mind that abbreviations cannot be recursive; a definition
like x = {x}a|ε
is illegal because it cannot be fully
expanded (not to mention that it doesn't add anything above the definition
x = a*
.) Recursive abbreviations would give us the full
power of context-free grammars, which is overkill for specifying lexical
analysis.
The following JFlex example is a good starting point for writing a JFlex spec:
MyLexer.jflex
For certain token classes, such as strings and comments, it is
handy to make use of lexer states in which the recognized
tokens differ. We can modify the previous JFlex example to support
comments, by introducing a COMMENT
state to simplify scanning comments.
The initial state of the lexer is YYINITIAL
; the function
yybegin()
can be used to change the state of the lexer in
an action. Regular expressions associated with a particular lexer state,
are only active when the lexer is in that state.
MyLexer2.jflex