How to pick an implementation language for CS 4120
This document has some guidelines and tips for choosing a language to write your compiler in for CS 4120. It's based on the experiences of past groups.
General guidelines
- Pick a language everyone in your group knows. Everyone should have written at least a few non-toy programs in it. Simultaneously learning a language while using it to write the compiler will be very challenging.
- Default to Java. The course staff and infrastructure are best equipped to support compilers written in it.
- Get the approval of the course staff before using something other than Java or OCaml. The staff have experience dealing with compilers written in various languages. They can tell you about pitfalls that you would rather not discover for yourself during the later programming assignments.
Golden rule for selecting a language
Pick a language that has an expressive type system, is statically typed, and is garbage-collected.
- An expressive type system means that types tell you a lot about the values they apply to. Modern Java has an expressive type system;
ArrayList<T>
definitely contains objects of type T. By contrast, early Java only had theArrayList
(with no generic parameter), so programmers only knew it had objects and thus frequently had to cast each element from Object to something useful. - In statically-typed languages, all variables have known types at compile time and all operations are type-checked by the compiler.
- In garbage-collected languages, the language runtime handles memory management for you. C and C++ are not garbage-collected, so you are allowed to write programs that contain memory leaks and uses of invalid memory locations. In a program as complex as a compiler, debugging these will be a waste of time. (Side-note: Rust has neither garbage collection nor manual memory management, and is discussed below.)
Notes on specific languages
- Java is the most common language used by CS 4120 groups. It has a large standard library and an expressive yet practical type system.
- OCaml is the second most common language. It has pattern matching, which is useful for the first few programming assignments. It is a functional programming language yet allows operations to be sequenced using semicolons. This fact comes in handy later in the course, when many useful algorithms are specified imperatively.
- Haskell has also been used by a few groups. A compiler in Haskell is a complex project that may require advanced features of Haskell, such as GADTs, type families, monad stacks, lenses,
-XRankNTypes
, and-XTemplateHaskell
. Familiarity with at least a few, and definitely with the build system (cabal or stack) you plan to use, is essential. - Python, although it has impressive prototyping speed, will have issues scaling to a project as large as a compiler. Execution speed may also become an issue.
- C and C++ have manual memory management and contain plenty of opportunities for programmers without very strong experience to make time-consuming errors.
- Rust has been used occasionally, and has modern language features and a very expressive type system. Strong familiarity with writing programs that pass both the type checker and the borrow checker is essential. Although lack of libraries is a common criticism of Rust, most libraries that you'll need for writing a compiler (for example, parser generators) already exist.
Again, remember that picking a language everyone is already familiar with is strongly recommended; the projects rapidly get more difficult and you probably don't want to be figuring out the documentation days before a deadline.