Programming Assignment 3

CS 412/413
Introduction to Compilers
Cornell University Computer Science Department, Spring 2001

Programming Assignment 3: Intermediate Code Generation

due Monday, March 12, in class

Updates

[2/23] Fixed runtime library routine names to include leading underscore.

In this programming assignment, you will build an intermediate code generator for the language Iota, defined in http://www.cs.cornell.edu/cs412/2001sp/iota/iota.html. We expect you to build upon the code that you wrote for Programming Assignments 1 and 2, and to fix any problems with your lexical analysis, parsing, or semantic analysis that interfere with IR generation for correct Iota programs.

Requirements

Your program will have a main class, Iota.IRGen, that translates Iota code to an intermediate representation. When invoked from the command line, the translator will implement at least the following minimal specification: it should expect to receive a single argument that is the name of a source file containing a module specification, with an extension of ".im". The translator will parse and type-check the program as in Programming Assignments 1 and 2. It will next convert the AST for each function into an intermediate representation (IR) and then put the IR into canonical form.

Your translator must support a number of dump options for debugging purposes:

It must continue to support the "-dump_ast" option from Programming Assignment 2.
The "-dump_cir" option prints to System.out a textual representation of the canonical IR of each of the functions in the module.
The "-dump_ir" option prints to System.out a textual representation of the (non-canonical) IR of each of the functions in the module.

You may choose to add additional options not mentioned here as long as they do not interfere with the required functionality. For example, your IR generator might accept more than one source file on the command line or support more debugging options than those listed. You may allow multiple dump options to be specified; we require support for only a single dump option on a particular command invocation.

Extra credit

Certain language features need not be implemented in this assignment. However, you will be required to implement them for a later programming assignment, and you may do them now for extra credit. If these language features are encountered in the program, the compiler should print a message: "<filename>:<lineno>:Language feature not implemented: <description>" and then exit. We strongly recommend that you get the rest of the IR generation working before tackling the extra credit items.

Logic Operators. Iota has the short-circuiting logical operators & and |. You are not required to implement these operations.

Bounds checks. The Iota language specification requires that the indices used to access elements of arrays and strings be bounds-checked. For this programming assignment, we do not require implementation of bounds checking.

Arithmetic checks. The Iota specification says that division by zero is caught and terminates the program. You do not need to implement this yet.

IR model

We expect that you will use a tree-based IR similar to the one described in class or the slightly different IR described by Appel. You may choose to add more IR nodes or modify some of the IR nodes; if you do so, you must describe the meaning of these new IR nodes and explain why you added them. We recommend that before coding you carefully design and document all of the IR translation functions and the transformations that you use to convert the IR to canonical form. Canonical form for the IR must have the properties described in lecture. In other words, all side-effects, including function calls, must be moved to the top level as statements. All SEQ and ESEQ nodes must be removed as part of this process. Also, two-way conditional jumps must be converted to one-way jumps by basic block reordering and the possible introduction of new jump statements.

Suggestions for how to proceed

You are free to implement this project in any order you see fit. However, we suggest that the first step is to carefully design your IR tree nodes and write the interfaces defining them. This will allow the IR translation and flattening phases to be attacked independently.

For IR translation, define a method on AST nodes that creates a corresponding IR node, as discussed in lecture. Write a placeholder method inherited by all AST nodes that generates an empty IR node or, even better, a special IR node that can contain the entire AST node that was to be transformed. When dumped, the special IR node uses your existing AST dumping facility. This placeholder method can then be implemented for each of the Iota language constructs in turn, so that you can test IR translation on simple programs and gradually add the functionality for translating all of the AST nodes to IR.

For tree flattening, you can take the same approach. First, attack the problem of bringing side-effecting nodes up to the top level. Elimination of two-way jumps can be worked on as a parallel problem. For the hoisting of side-effecting nodes, define a transformation method for each IR node that reduces it to a sequence of statement nodes plus an expression node (when flattening expressions, as described in class.) Again, you can use a placeholder method to allow incremental development and testing. The placeholder transformation method simply returns the statement node or expression node it is invoked on. This approach will allow you to test out your tree-flattening functionality without having implemented all of the relevant code.

Output format

With the appropriate options, your compiler will print a representation of the IR for each of the functions in the source program. Each function's IR should be preceded by a line "Function <name>:". As in Programming Assignment 2, the printed IR representation should use parentheses to indicate tree structure. The output will be simpler because there is no need to print the types of the various nodes. Binary operation nodes may be indicated using the corresponding symbol. The non-canonical IR for each function will be a single item in the print-out; the canonical IR will be a list of parenthesized items, one for each statement in the canonical IR. For example, the Iota function

    f(x: int) = x + g(x)

might be compiled to IR and printed out as follows:

    Function f:
    MOVE(RV, MEM(FP + 8) + CALL(NAME(g), MEM(FP + 8)))

For compactness, + nodes in the trees are printed out in infix notation here; CONST nodes are printed simply as the corresponding constant, and FP and RV denote the frame pointer and return value registers, respectively. In canonical form, CALL nodes must be hoisted to the top of the tree, resulting in output like the following:

    Function f:
    MOVE(TEMP(t1), CALL(NAME(g), MEM(FP + 8)))
    MOVE(RV, MEM(FP + 8) + TEMP(t1))

Pretty-printing the IR output

To help you generate reasonable dumps of these IR trees, you should continue to use the CodeWriter class. Your IR nodes must implement the Unparse interface from Programming Assignment 2:

    interface Unparse { 
        public void unparse(Iota.util.CodeWriter cw); 
            // Write a human-readable representation of the node to the given 
            // CodeWriter cw. 
    }

Do not use the newline() method to insert newlines except perhaps at the end of each statement in the canonical IR. As an example, the ESEQ node described in Appel might be dumped with the following code:

    class ExpSeqNode extends IRNode implements Unparse { 
        StmNode stm;
        ExpNode exp;

        public void unparse(CodeWriter cw) { 
            cw.write("ESEQ("); 
            cw.begin(0); 
    
            stm.unparse();

            cw.write(","); 
            cw.allowBreak(0); 

            exp.unparse();

            cw.write(")"); 
            cw.end(); 
        } 
    }

Implementing Iota

Word size

In the next assignment, you will be generating 32-bit Pentium code, so all variables and registers will take a full 32 bits of storage. Although it would be possible to generate code that stores characters and booleans more compactly, these data types will be stored in 32-bit words as well. All stack locations and dynamically allocated objects will be aligned to 32-bit boundaries: the address of a variable or dynamically allocated object always will be divisible by four.

Iota modules and name mangling

An Iota program may comprise several Iota modules that must be linked together. When you generate code for an Iota module, you must convert the names used for publicly visible symbols in the Iota program into a globally unique form, so that the linker can link external references from other modules to these global variables or functions. The public variables and functions in a module are those declared in the interface file of the module; these are the only names that other modules can talk about. Because the linker only understands names that are simple strings, it is necessary to encode Iota's qualified names as simple identifiers. This process of name conversion is known as name mangling.

Global variables and function names in Iota are translated to assembler names as follows. A global identifier I in module M is translated as an assembler symbol _M__I; that is, two underscore characters are placed after the name of the module, and one before. For example, if the Iota program expression contains a call like F(...), an invocation of a function F in another module M, the following IR might be generated for the call:

       CALL(NAME(_M__F), ...)

These fully-expanded names are used even when the global function or variable is found within the same module as the call itself. This name mangling scheme is chosen for simplicity. Obviously you can create confusion by writing Iota programs in which double underscores appear in the identifiers. A better mangling scheme would avoid this problem but be more complex to implement. Note that the actual name that is used by the referencing module is not used for mangling; the important name is the name that the defining module uses. This distinction is important for handling uses clauses of the form a=b.c. Where the identifier a is used, the IR should contain the mangled name _b__c instead.

The only exception to this name-mangling strategy is the function main, which is the first function executed in the program. To allow the run-time system to find this function, its name must be mangled as _iota__main regardless of what module it occurs in.

Iota runtime

Various Iota language constructs require runtime support. For the next assignment in which you will generate assembly code, a library of runtime support routines will be provided. These functions are available to the assembler code you will generate but are not exposed to the Iota programmer. Your IR should make use of these functions as appropriate, invoking them with the CALL node.

_iota__newarray. Takes two arguments, the first argument is the desired number of elements in the array, the second is a value to which all elements will be initialized. The return value is a pointer to the allocated storage. Be careful! Recall that in Iota the expression used to initialize arrays is evaluated once for each element.
_iota__abort. No arguments. Causes your program to crash.
_iota__strconcat. Takes two arguments, which are both pointers to Iota strings. Concatenates the two strings to a new location in memory. Returns a pointer to the concatenated string. This is the only routine you should need to create strings, because you should put string constants in the data segment.
_iota__strcompare. Takes two arguments, which are both pointers to Iota strings. Compares the strings lexicographically. The return value is an integer: -1 if s0 < s1, 0 if s0 == s1, 1 if s0 > s1.

You will also have the ability to define your own runtime support routines in the next assignment. You may define such routines during this assignment and use them in your IR. If you do so, you must submit the C source code that implements your routines with this assignment. You should be aware that a call to a runtime routine will likely be more expensive than the IR code that it replaces, even if that IR code is fairly complicated. Be sparing with the number of routines you introduce, because frequent calls to runtime routines will likely degrade performance of the code your compiler emits. As another consequence, use of such routines in the implementation of one of the extra credit items above will disqualify you from receiving that particular extra credit.

Dynamically allocated objects

Arrays and strings will be stored in memory in the manner described in class. To create new arrays and strings, you will use the _iota__newstring and _iota__newarray routines described above. Note that these routines allocate the correct amount of space to store the array or string, including its length. The values returned by these routines actually are pointers to the 32-bit word following the length; that is, to the array elements or string characters themselves. This is the address you should store (whether on the stack or heap) for the array or string variable.

Strings are stored in a packed representation in which every character is placed in a consecutive byte. The string is always terminated by an extra null character (0), and the string is also padded out with nulls until it fills out a 32-bit word completely. Here are some examples of string layouts in memory.

An array is laid out similarly in memory, although each element of the array takes up a full 32-bit word and there is no terminator:

The Iota language specification requires that accesses to elements of arrays and strings be bounds-checked, although it is not required for this assignment. To implement such a feature, it is necessary to have a way to halt the program in mid-execution. The procedure _iota__abort can be used to accomplish this. Code that invokes _iota__abort should print some useful message first.

Return statement

The Iota return statement can be converted to IR in a number of ways. We recommend that a return statement be converted to a sequence of two IR statements. The first statement moves the returned expression (if any) into the return-value register (RV), and the second statement jumps to the label of the current function epilogue. This strategy allows the function epilogue to be shared by all possible return paths from the function. Of course, the epilogue must have a label for this to work. Creation of the bodies of the function prologue and epilogue can be postponed until code generation. Consequently, the function epilogue in the IR is empty except for a single labeled statement; we would like you to give it the label F_epilogue where F is the name of the function in question. The function prologue does not need to be represented in the IR at all; it will be generated in the next compiler phase.

What to turn in

Turn in on paper:

A brief document (max 3 pages) describing your IR generator and its structure, and your testing strategy. This document should be clear, concise, and convincing. Justify any changes made to the problem set specifications.
A design document (max 5 pages) describing the translations and transformations your program implements. This document does not need to be a complete description, but it should describe any interesting translation or transformation problems you encountered, and how you resolved them. The use of the notations introduced in class for describing code transformations is encouraged but not required.
As always, we are interested in hearing your overall thoughts about the assignment: how much time you spent on it, what was the most difficult part, what was the most valuable or interesting part, and how you think it could be made better.

Turn in electronically:

Source code
Compiled .class files properly located in their directory hierarchy
javadoc output that we can browse on-line. We should be able to browse through all your classes and interfaces, including all their methods and fields.
The test cases you mentioned in the paper document.

Electronic submission instructions

Your electronic submission is expected at the same time as your written submission: at the beginning of class on the due date. Electronic submissions after 10AM will be considered a day late. To submit Programming Assignment 3, please drop your files in \\goose.csuglab.cornell.edu\courses\cs412-sp01\grpX\pa3, where grpX is your group identifier. Please organize your top-level directory structure as follows :

src\ - all of your source code and class files. For example, we expect to find a class file for your main program in src\Iota\IRGen.class, and a companion .java file in the same directory.
doc\ - documentation, including your write-up and a README.TXT containing information on how to compile and run your project, a description of the class hierarchy in your src\ directory, brief descriptions of the major classes, any known bugs, and any other information that we might find useful when grading your assignment.
test\ - any test cases you used in testing your project.

Warning: Failure to submit your assignment in the proper format will result in deductions from your grade.

CS 412/413 Introduction to Compilers Cornell University Computer Science Department, Spring 2001