A Front-End for SimPL
As a demonstration of lexing and parsing, we'll now develop a front end for SimPL using ocamllex and menhir.
Earlier, we already developed an AST type:
type bop =
| Add
| Mult
| Leq
type expr =
| Var of string
| Int of int
| Bool of bool
| Binop of bop * expr * expr
| Let of string * expr * expr
| If of expr * expr * expr
Let's assume that code is in a file named ast.ml
.
Parsing with Menhir
Let's start with parsing, then return to lexing later. We'll assume
all the Menhir code we write below is in a file named parser.mly
.
You can
view the completed file here.
The .mly
extension indicates that this file is intended as input
to Menhir. (The 'y' alludes to yacc.) This file contains
the grammar definition for the language we want to parse.
The syntax of grammar definitions is described by example below.
Be warned that it's maybe a little weird, but that's because
it's based on tools (like yacc) that were developed quite awhile ago.
Menhir will process that file and produce a file named parser.ml
as output; it contains an OCaml program that parses the language.
(There's nothing special about the name parser
here; it's just
descriptive.)
There are four parts to a grammar definition: header, declarations, rules, and trailer.
Header. The header appears between %{
and %}
. It is code that
will be copied literally into the generated parser.ml
. Here we use it
just to open the Ast
module so that, later on in the grammar
definition, we can write expressions like Int i
instead of Ast.Int
i
. If we wanted we could also define some OCaml functions in the
header.
%{
open Ast
%}
Declarations. The declarations section begins by saying what the lexical tokens of the language are. Here are the token declarations for SimPL:
%token <int> INT
%token <string> ID
%token TRUE
%token FALSE
%token LEQ
%token TIMES
%token PLUS
%token LPAREN
%token RPAREN
%token LET
%token EQUALS
%token IN
%token IF
%token THEN
%token ELSE
%token EOF
Each of these is just a descriptive name for the token. Nothing so far
says that LPAREN
really corresponds to (
, for example. We'll take
care of that when we define the lexer.
The EOF
token is a special end-of-file token that the lexer will
return when it comes to the end of the character stream. At that point
we know the complete program has been read.
The tokens that have a <type>
annotation appearing in them are
declaring that they will carry some additional data along with them. In
the case of INT
, that's an OCaml int
. In the case of ID
, that's
an OCaml string
.
After declaring the tokens, we have to provide some additional information
about precedence and associativity. The following declarations say that
PLUS
is left associative, IN
is not associative, and PLUS
has higher precedence than IN
(because PLUS
appears on a line after IN
).
%nonassoc IN
%nonassoc ELSE
%left LEQ
%left PLUS
%left TIMES
Because PLUS
is left associative, 1 + 2 + 3
will parse as (1 + 2) + 3
and
not as 1 + (2 + 3)
. Because PLUS
has higher precedence than IN
, the
expression let x = 1 in x + 2
will parse as let x = 1 in (x + 2)
and not as (let x = 1 in x) + 2
. The other declarations have similar effects.
Getting the precedence and associativity declarations correct is one of the trickier parts of developing a grammar definition. It helps to develop the grammar definition incrementally, adding just a couple tokens (and their associated rules, discussed below) at a time to the language. Menhir will let you know when you've added a token (and rule) for which it is confused about what you intend the precedence and associativity should be. Then you can add declarations and test to make sure you've got them right.
After declaring associativity and precedence, we need to declare what
the starting point is for parsing the language. The following
declaration says to start with a rule (defined below) named prog
.
The declaration also says that parsing a prog
will return an OCaml
value of type Ast.expr
.
%start <Ast.expr> prog
Finally, %%
ends the declarations section.
%%
Rules. The rules section contains production rules that resemble BNF, although where in BNF we would write "::=" these rules simply write ":". The format of a rule is
name:
| production1 { action1 }
| production2 { action2 }
| ...
;
The production is the sequence of symbols that the rule matches. A symbol is either a token or the name of another rule. The action is the OCaml value to return if a match occurs. Each production can bind the value carried by a symbol and use that value in its action. This is perhaps best understood by example, so let's dive in.
The first rule, named prog
, has just a single production. It says
that a prog
is an expr
followed by EOF
.
The first part of the production, e=expr
, says to match an expr
and bind
the resulting value to e
. The action simply says to return that value e
.
prog:
| e = expr; EOF { e }
;
The second and final rule, named expr
, has productions for all the expressions
in SimPL.
expr:
| i = INT { Int i }
| x = ID { Var x }
| TRUE { Bool true }
| FALSE { Bool false }
| e1 = expr; LEQ; e2 = expr { Binop (Leq, e1, e2) }
| e1 = expr; TIMES; e2 = expr { Binop (Mult, e1, e2) }
| e1 = expr; PLUS; e2 = expr { Binop (Add, e1, e2) }
| LET; x = ID; EQUALS; e1 = expr; IN; e2 = expr { Let (x, e1, e2) }
| IF; e1 = expr; THEN; e2 = expr; ELSE; e3 = expr { If (e1, e2, e3) }
| LPAREN; e=expr; RPAREN {e}
;
The first production,
i = INT
, says to match anINT
token, bind the resulting OCamlint
value toi
, and return AST nodeInt i
.The second production,
x = ID
, says to match anID
token, bind the resulting OCamlstring
value tox
, and return AST nodeVar x
.The third and fourth productions match a
TRUE
orFALSE
token and return the corresponding AST node.The fifth, sixth, and seventh productions handle binary operators. For example,
e1 = expr; PLUS; e2 = expr
says to match anexpr
followed by aPLUS
token followed by anotherexpr
. The firstexpr
is bound toe1
and the second toe2
. The AST node returned isBinop (Add, e1, e2)
.The eighth production,
LET; x = ID; EQUALS; e1 = expr; IN; e2 = expr
, says to match aLET
token followed by anID
token followed by anEQUALS
token followed by anexpr
followed by anIN
token followed by anotherexpr
. The string carried by theID
is bound tox
, and the two expressions are bound toe1
ande2
. The AST node returned isLet (x, e1, e2)
.The last production,
LPAREN; e = expr; RPAREN
says to match anLPAREN
token followed by anexpr
followed by anRPAREN
. The expression is bound toe
and returned.
The final production might be surprising, because it was not included in the BNF we wrote for SimPL. That BNF was intended to describe the abstract syntax of the language, so it did not include the concrete details of how expressions can be grouped with parentheses. But the grammer definition we've been writing does have to describe the concrete syntax, including details like parentheses.
There can also be a trailer section after the rules, which like the header
is OCaml code that is copied directly into the output parser.ml
file.
Lexing with Ocamllex
Now let's see how the lexer generator is used. A lot of it will feel familiar from our discussion of the parser generator.
We'll assume all the ocamllex code we write below is in a file named
lexer.mll
.
You can
view the completed file here.
The .mll
extension indicates that this file is intended
as input to ocamllex. (The 'l' alludes to lexing.) This file contains
the lexer definition for the language we want to lex.
Menhir will process that file and produce a file named lexer.ml
as output; it contains an OCaml program that lexes the language.
(There's nothing special about the name lexer
here; it's just
descriptive.)
There are four parts to a lexer definition: header, identifiers, rules, and trailer.
Header.
The header appears between {
and }
. It is code
that will simply be copied literally into the generated lexer.ml
.
{
open Parser
}
Here, we've opened the Parser
module, which is the code in parser.ml
that was produced by Menhir out of parser.mly
. The reason we open it
is so that we can use the token names declared in it, e.g., TRUE
,
LET
, and INT
, inside our lexer definition. Otherwise, we'd
have to write Parser.TRUE
, etc.
Identifiers. The next section of the lexer definition contains identifiers, which are named regular expressions. These will be used in the rules section, next.
Here are the identifiers we'll use with SimPL:
let white = [' ' '\t']+
let digit = ['0'-'9']
let int = '-'? digit+
let letter = ['a'-'z' 'A'-'Z']
let id = letter+
The regular expressions above are for whitespace (spaces and tabs), digits (0 through 9), integers (nonempty sequences of digits, optionally preceded by a minus sign), letters (a through z, and A through Z), and SimPL variable names (nonempty sequences of letters) aka ids or "identifiers"—though we're now using that word in two different senses.
FYI, these aren't exactly the same as the OCaml definitions of integers and identifiers.
The identifiers section actually isn't required; instead of writing
white
in the rules we could just directly write the regular expression
for it. But the identifiers help make the lexer definition more
self-documenting.
Rules. The rules section of a lexer definition is written in a notation that also resembles BNF. A rule has the form
rule name =
parse
| regexp1 { action1 }
| regexp2 { action2 }
| ...
Here, rule
and parse
are keywords. The lexer that is generated will
attempt to match against regular expressions in the order they are
listed. When a regular expression matches, the lexer produces the token
specified by its action
.
Here is the (only) rule for the SimPL lexer:
rule read =
parse
| white { read lexbuf }
| "true" { TRUE }
| "false" { FALSE }
| "<=" { LEQ }
| "*" { TIMES }
| "+" { PLUS }
| "(" { LPAREN }
| ")" { RPAREN }
| "let" { LET }
| "=" { EQUALS }
| "in" { IN }
| "if" { IF }
| "then" { THEN }
| "else" { ELSE }
| id { ID (Lexing.lexeme lexbuf) }
| int { INT (int_of_string (Lexing.lexeme lexbuf)) }
| eof { EOF }
Most of the regular expressions and actions are self-explanatory, but a couple are not:
The first,
white { read lexbuf }
, means that if whitespace is matched, instead of returning a token the lexer should just call theread
rule again and return whatever token results. In other words, whitespace will be skipped.The two for ids and ints use the expression
Lexing.lexeme lexbuf
. This calls a functionlexeme
defined in theLexing
module, and returns the string that matched the regular expression. For example, in theid
rule, it would return the sequence of upper and lower case letters that form the variable name.The
eof
regular expression is a special one that matches the end of the file (or string) being lexed.
Note that it's important that the id
regular expression occur nearly
last in the list. Otherwise, keywords like true
and if
would be lexed
as variable names rather than the TRUE
and IF
tokens.
Generating the Parser and Lexer
Now that we have completed parser and lexer definitions in parser.mly
and lexer.mll
, we can run Menhir and ocamllex to generate the
parser and lexer from them. Ocamlbuild already knows how to run those
tools. For example:
$ ocamlbuild lexer.ml
will generate _build/lexer.ml
from lexer.mll
, and
$ ocamlbuild -use-menhir parser.ml
will generate _build/parser.ml
from parser.mly
.
The flag -use-menhir
alerts ocamlbuild to use Menhir instead of ocamlyacc
for files with the .mly
extension. You can instead put that into your _tags
file:
true: use_menhir
The Driver
Finally, we can pull together the lexer and parser to transform a string into
an AST. Put this code into a file, e.g., main.ml
:
open Ast
let parse (s : string) : expr =
let lexbuf = Lexing.from_string s in
let ast = Parser.prog Lexer.read lexbuf in
ast
This function takes a string s
and uses the standard library's
Lexing
module to create a lexer buffer from it. Think of that buffer
as the token stream. The function then lexes and parses the string into
an AST, using Lexer.read
and Parser.prog
. The function Lexer.read
corresponds to the rule named read
in our lexer definition, and the
function Parser.prog
to the rule named prog
in our parser
definition.
Note how this code runs the lexer on a string; there is a corresponding
function from_channel
to read from a file.
We could now use parse
interactively to parse some strings. For example:
# parse "let x = 3110 in x + x";;
- : expr = Let ("x", Int 3110, Binop (Add, Var "x", Var "x"))
That completes lexing and parsing for SimPL!