Modern languages have module mechanisms to support modular programming. For example, Java has classes and OCaml has modules. Modules promote loose coupling between different parts of the program, by hiding internal implementation details and by restricting access to internal mechanisms. They also support separate compilation, in which the program comprises separately compiled components—compilation units—that are later linked together to form a full program.
Module mechanisms complicate type checking because they introduce multiple namespaces that must be used in the right way to resolve names of variables, functions and types. Modules can also be parameterized on type variables, can introduce abstract type variables, and can be recursive. These features make type checking even trickier.
Modules define a set of named resources that can be used by other components of the program. The resources exported by the module are the module's interface. Resources may include variables, constants, functions, types, and even nested modules. For each module used by the current code, the compiler needs a way check its interface for named resources and to generate code using them. Typically, this is done by creating a symbol table for each module.
When compiling a module, the compiler must also know about names that are defined internally to the module but are not accessible outside (e.g., Java's “private” fields and methods). Therefore, within the module, we have an internal environment that is larger than the one that is exported.
In many languages, such as OCaml or C, a module's interface is declared separately from the implementation, and there are really two separate environments. When compiling a module, the compiler must do a conformance check that these two environments conform—they must agree on the type and kind of all names that appear in the interface, so that external code using the module is able to access its resources correctly.
Figure 1: separately compiling Java classes
Sometimes conformance is deferred until link time or run time, as shown in
Figure 1. In the figure, the file A.java
, containing the classes
\(A\) and \(B\), is compiled against the interface of the class \(C\). The
interface of \(C\) is obtained automatically from the class file for \(C\),
where the compiler puts it. (In Java, if the class file is missing, the
interface might also be obtained from the source file for the class, if
available, and in this case the compiler will compile the additional source
file automatically.) In other languages, the interface is often a separate
source file.
During compilation, the compiler builds a top-level environment that maps
globally visible names such as A
, B
, and
C
to compile-time representations of the corresponding modules
(classes). Typically these compile-time representations would be symbol tables
describing the names defined by the classes: instance variables and methods.
These symbol tables are environments mapping names of class members to the
corresponding types (and other information). For example, there is an
A
environment mapping the names x
and f
to representations of their types. Similarly, the B
environment
maps the name y
to its type, which is C
, for which
there is an environment that can be used to resolve the type of an expression
like y.g
. The module representation also contain a reference to
an AST for the class declaration.
At run time in Java, there is no guarantee that the actual code of class C used at run time, obtained from the classpath, implements the interface that was used at compile time. For type safety, a conformance check must be done. Therefore, the class file for class A contains type signatures for the parts of C that are used; these signatures are checked when methods of C are used, to make sure that they agree with the class file for C that is loaded at run time. In the C language, by contrast, source files are compiled against declarations of functions from other compilation units but there is no run-time check that the implementation used at run time matches the signature that was used to compile calls to it.
Modules can hide types completely. For example, in OCaml we can write
a module interface M
that announces the existence of a type t
without
revealing what the type is:
module type M = sig type t val f : T → T end
Here is it necessary to introduce a binding in the environment for
M
that says t
is a type without identifying which type it is.
Since the compiler doesn't know which type t
is, it doesn't know
how much storage is required to implement t
. Unless the linker can
later patch code that uses t
, the compiler must generate code that
works regardless of the size of t
. This flexibility is usually
achieved through an indirection, relying on the usually valid assumption that
pointers have the same size no matter what they point to. As far as client code
is concerned, the type t
is a word-sized pointer, and the client
code doesn't have to know how big the chunk of memory is that the pointer
points to.
At this point we have two conflicting notions of how to represent types in the compiler. One is to follow the type grammar and represent the type as an AST. The second is to construct an object (such as a symbol table) that describes the type and supports more efficient operations such as looking up signatures. For classes in an object-oriented language, which are modules but also are also defined by an AST, the choice is particularly difficult.
The strategy of using AST nodes to represent types is feasible and perhaps the simplest approach, but it does have some pitfalls. Consider the following Java-like example with nested classes:
class A extends B { int x; D y; class C { int z; ... y.w ... } } class D { C w } class C { Object z; }
The type of the term y.w
is C
, but it is the top-level C
.
However, the name C
within the context where the term y.w
occurs
refers to the nested class C
. If we are not careful about how we
represent and resolve types, the compiler will think that y.w.z
is
an int
rather than an Object
, allowing run-time type errors.
In general, the AST strategy for representing types makes it necessary to
represent a type not as just an AST for the type expressions, but as a
pair that includes both the type AST and a symbol table that should be used to
resolve any identifiers that are found within the type. For example, we can
represent the type of y.w
as the type expression C
,
along with the typing context (symbol table) for term y.w
.
Another strategy for representing types in the compiler is to interpret types at
compile time as type objects. With this strategy, the type of the field w
would be interpreted in the context of class D
, and correctly resolved to the
outer class C
. The type object for this class would then have an attached
environment for looking up methods and instance variables, as suggested in Figure 1.
Interpreting type objects creates challenges for dealing with recursive modules, because it suggests the compiler may need to resolve class names to class objects before the class objects themselves are created. For example, class A might refer to class B, so while resolving type declarations in A, there must already be some kind of class object for B, and vice-versa.
To break the circularity, type checking can be done in multiple passes over the AST:
First pass: Create type objects. Find all classes referenced in the source code and create empty class objects for all of them.
Second pass: Construct method signatures. Walk over all methods or functions and construct their types. This handles recursion among methods.
Third pass: Type-check the code, using the method signatures and type objects created in previous passes.
An important choice for compiler design is whether the AST will be mutable or immutable. In the more classic approach, a mutable AST is a single data structure that is decorated by various compiler passes with additional information such as types. This approach is efficient because the same data structure is reused by multiple passes, but as with any other complex, mutable data structure shared by multiple clients, programming is error-prone.
In the immutable AST approach, no AST is changed after it is created. It is the job of each compiler pass to create a new AST that represents the code transformation performed by that pass. Because the passes are more independent, this approach is less error-prone. Copying entire ASTs is not very efficient, but if the AST is truly immutable, some efficiency can be regained by allowing different ASTs to share the same underlying nodes.
We've seen that type checking is conveniently implemented as a recursive traversal of the AST. But is not the only recursive AST traversal, because there are in general multiple passes over the AST. In addition to the ones just discussed, we might also traverse the AST to do optimizations like constant folding, other semantic checks such as for uninitialized variables, and code generation. After writing a couple of such traversals, it is common to discover that there is a lot of code repeated across multiple traversals: so-called boilerplate code that contains the skeleton of a traversal. How this boilerplate looks depends on the programming style: either object-oriented (OO) or functional style.
In the object-oriented style, a given compiler pass over the AST is
represented as a recursive method implementing the behavior of that pass
for each particular class, specifying how to traverse that node type and
perform whatever work is needed by that particular compiler pass).
For example, type checking
may be implemented as a method typeCheck(...)
that
specifies how each node is type-checked. The object-oriented style
puts all the code of a given node in one place: its class. This
structure has the advantage that it is easy to add new node types to
the compiler. On the other hand, there is replication of traversal
boilerplate code across the different passes implemented in each
class. Another weakness of the object-oriented style is that adding
a new pass requires, in general, changing many classes.
In the functional style of traversal, a compiler pass instead becomes as a recursive function that implements that pass for all node types. It is easy to add new passes to the compiler because no change is needed to previously implemented passes. On the other hand, there tends to be significant repetition of boilerplate traversal code among the various passes. And adding new node types to the compiler is not a local change: each of the passes must be updated to handle the new node type.
The OO style naturally keeps all the code associated with a given AST type together in one place; the functional style naturally keeps all the code associated with a given pass together in one place. We can visualize the compiler code as forming a table, in which the rows are AST node types and the columns are different compiler passes. In the OO style, the code is organized around rows of the table, so it is easy to add a new row in a modular way. In the functional style, the code is organized around columns, so it is easy to add a new column. Either style comes with tradeoffs.
OO style vs. functional style
The tension between organizing code around rows and columns, while being able to extend either horizontally or vertically in a modular way, is usually called "The Expression Problem". Various programming languages have introduced mechanisms that address this problem, such as multimethods, polymorphic variants (OCaml), and family polymorphism. However, these are all fairly exotic mechanisms that are probably unavailable in the language you are using.
The Visitor pattern is a design pattern that supports functional-style traversal in an object-oriented language, while reducing boilerplate. The key idea is to keep all of the logic and state of a (compiler) traversal in a Visitor object, while having the individual node object supply the boilerplate traversal code. There is a base Visitor class with stub methods for each of the node types, something like the following:
class Visitor { void visitPlus(Plus node); void visitIf(If node); void visitNum(Num node); void visitNot(Not node); ... }
A particular compiler traversal is implemented as a subclass of Visitor
,
providing the logic to be applied to each node. If some node type requires special logic,
it is provided by the subclass by overriding that node type's method. For example,
a class for type checking would override the visit method to provide code for
implementing typing rules.
class TypeChecker extends Visitor { SymbolTable env; void visitPlus(Plus p) { if (p.e1.type.equals(intType) && p.e2.type.equals(intType)) { p.type = intType; } } ... }
The boilerplate traversal code is provided by the method accept
of node classes:
class Expression { Type type; } class Plus extends Expression { Expr e1, e2; void accept(Visitor v) { super.accept(v); v.visitPlus(v); } }
The Polyglot visitor pattern is used
in the Polyglot compiler framework
to support traversals of immutable ASTs,
and supports building ASTs as a traversal output while sharing nodes with the original AST.
The accept
method is modified to return a node. It also supports modifying
the visitor object in a functional way for use when traversing the children of the node.
The boilerplate traversal code is provided by a method visitChildren
on Node.
Its job is to traverse the children and build a new node if necessary.
class Node { Node accept(Visitor v) { Visitor v2 = v.enter(this); Node n = visitChildren(v2); return v2.leave(n, this); } } class TypeChecker extends Visitor { Node leave(Node new_node, Node orig) { return new_node.typeCheck(this) } } class Plus extends Expression { Expr visitChildren(Visitor v) { Expr new_e1 = e1.accept(v); Expr new_e2 = e2.accept(v); if (new_e1 != e1 || new_e2 != e2) { return new Plus(new_e1, new_e2); } else { return this; // no new node needed } } Node typeCheck(TypeChecker tc) { // type-checking logic. May build a new node // or imperatively modify this. } }