CS410P/510 Programming Language Compilation What is this course? - Hands-on introduction to compiling high-level languages to machine code, using "toy compiler" approach. - Fairly generic statically-typed source languages, including arithmetic, booleans, expressions, statements, functions, records (Not too much about FP or OO specific issues). (Maybe just a bit about dynamically-typed languages.) - Target: X86-64 assembly code (then turned into machine code) and a runtime system (coded in C) - As time permits, also look briefly at high-performance interpretation strategies Topics: - Compiler structure - Program representations - Machine code generation - Register allocation - Garbage collection - Dataflow analysis - Optimization - Efficient interpretation - Correctness (and Verification) Plenty to do, but many important things will be omitted, e.g. - Front ends (parsing, error messages, etc.) - Paradigm-specific issues (first-class functions, objects,...) - Concurrency - Targeting specialized (heterogeneous) hardware - JIT compilation - etc. Why take this course? - Learn what's really going on inside a crucial part of computing systems - Gives insight into behavior of real compilers/interpreters. - Gives basic foundation for doing further work in compiler or runtime system research. - Get experience doing substantial coding in an interesting domain. This is a fairly new course, based on a new(ish) book. - "Essentials of Compilation: An Incremental Approach in Python" - Jeremy Siek, Indiana University; Scheme/Racket - Heavily based on program reading/writing. - An alternative we're not following: using real infrastructure, e.g. JVM, LLVM, WebAssembly, etc. (but pointers will be given as we go) FORMAT: - Based around book, basically one chapter per week. - Weekly homeworks. Will need to cut down the exercises -- less code writing. But still quite a bit of code READING. - Homeworks graded primarily on passing tests. - Class lectures will cover the book material, with some additions. (There may also be a few additional assigned readings.) - Homeworks due at noon on Tuesdays, starting next week. - Work in teams; sharing help across teams is also ok. - Lots of code review in class, both before and after the assignments are due. - Midterm and final exam won't involve fresh coding, but will cover the same material again. (Mainly to make sure that every individual is getting the material.) Exams will be in person! WHY TEAMS? - Community for shared learning - Improve quality using pair programming - Share workload (not perfectly efficient, but still useful) - Opportunity for socialization - Allows me to give better feedback ADMIN: - Course web page - Slack channel - Also lecture notes (rarely), recordings, tablet captures. - Homework out and in via github classroom. -------------------------------------------------------------------------- OVERALL VIEW: Task: Making a high-level language run on a low-level machine. High-level: expressions, structured control, data types, functions w/local data, managed memory, ... Low-level: primitive instructions, flat memory + limited registers, labels + conditional jumps, ... ----------- Interpreters vs. Compilers - Lower the source or raise the target? -------------- Nature of task: - Not so hard to get working code (but lots of details!) - Much harder to get high-quality code => Much of the work is in optimization (though not in this course) - Languages are getting more sophisticated/abstract - Hardware is getting more complex/unpdredictable/hetereogneous. - Engineering; not much science (or math) - Correctness is a big concern (but verification is still very rare) ---------------------- Generic Compiler Architecture - Source code | V o Lexical Analysis \& Parsing | V - Abstract Syntax Tree (AST) | V o Type-checking and other Static Correctness Analysis | V - (Revised) AST (for legal program) | V o Intermediate Code Generation | V - Intermediate Representation (IR) --> o Interpreter | V o Machine-independent Optimization | V - Revised IR | V o Target Code Generation | V - Machine Code | V o Machine-dependent optimization | V - Improved Machine Code | V o Code Emission | V - Binaries (files or core images) | V o Linker/Loader | V - Core image that can be executed (Combined with runtime library: interface to O/S, memory management, thread support, etc.) CAVEATS: - Often mix interp/compilation models, e.g. Java/JVM. - JIT compilation ------------------------------------ NOVELTY OF BOOK'S APPROACH: - Slice vertically into features, rather than horizontally into passes. - Also: use lots of tiny passes ("nano-passes") between specialized IRs - Equip each IR with interpreter and checkers: systematic approach to checking and testing after each phase. ------------------------------------ Python as implementation language: - Well-known language, chosen for its familiarity. (As opposed to Racket, OCaml.) - Very easy to get started with; huge amount of tutorial material on the web. - No strong type checking: makes development harder. * We will attempt to use type annotations and gradual typing tools to improve this somewhat. - A very small subset will ALSO serve as the source language for our compiler. No very strong reason for this. - We will be using some relatively recent features, notably match statements. * And taking a somewhat "functional" approach. * But also using some essential OO features. - Would like to grow the Python resources section of the website.