CS164 Programming Languages and Compilers Fall 2024
Programming Assignment 3
Assigned: Nov 4, 2024 Checkpoint: Nov 24, 2024 at 11:59pm Due: Dec 6, 2024 at 11:59pm
1 Overview
The four programming assignments in this course will direct you to develop a compiler for ChocoPy, a statically typed dialect of Python. The assignments will cover (1) lexing and parsing of ChocoPy into an abstract syntax tree (AST), (2) semantic analysis of the AST, (3) code generation, and (4) optimization.
For this assignment, you are to implement a RISC-V code generator for ChocoPy. This phase of the compiler takes as input the type-annotated AST of a semantically valid and well-typed ChocoPy program, and produces as output RISC-V assembly code. Section 6 describes the version of RISC-V that we will be using, as well as the execution environment used for grading.
This assignment is also accompanied by the ChocoPy RISC-V implementation guide, which is a document that describes in detail the design decisions taken by the reference compiler. Unlike previous assignments, the starter code provided for this assignment is quite extensive. We encourage you to make full use of this code, since it will save you about half the development effort of building a code generator. Reading the accompanying implementation guide is essential to understanding the provided starter code. This assignment can get a bit tedious, so start early. However, implementing a code generator can be a very rewarding task, since you will (finally) be able to execute ChocoPy programs and observe their behavior.
2 Getting started
We are going to use the Github Classroom platform for managing programming assignments and submissions.
• Visit https://classroom.github.com/a/0GbSdUtZ for the assignment. You will need a GitHub account to join.
• The first team member accepting the assignment should create a new team with some rea- sonable team name. The second team member can then find the team in the list of open teams and join it when accepting the assignment. A private GitHub repository will be created for your team. It should be of the form https://github.com/cs164fa2024/ pa3-chocopy-code-generation-
• Ensure you have Git, Apache Maven, JDK 21+ and Python 3.11+ installed. See Section 3 for more information regarding software.
• If your team name is
It will contain all the files required for the assignment. Your repository must remain private; otherwise, you will get 0 points in this assignment.
• Add the upstream repository in order to receive future updates to this repository. This must be done only once per local clone of your repository. Run
git remote add upstream \
https://github.com/cs164fa2024/pa3-chocopy-code-generation.git
• Run mvn clean package. This will compile the starter code, which analyzes all declarations in a ChocoPy and emits everything that is needed in the data segment, as well as a skeleton text segment for the top-level statements. Your goal is to emit code for top-level statements as well as for every function/method defined in the ChocoPy program.
• Run the following command to test your analysis against sample inputs and expected outputs— only one test will pass with the starter code:
java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy –pass=..s \
–run –dir src/test/data/pa3/sample –test
Windows users should replace the colon between the JAR names in the classpath with a semicolon: java -cp “chocopy-ref.jar;target/assignment.jar” …. This applies to all java commands listed in this document.
3 Software dependencies
The software required for this assignment is as follows:
• Git, version 2.5 or newer: https://git-scm.com/downloads
• Java Development Kit (JDK), version 21 or newer: http://www.oracle.com/technetwork/ java/javase/downloads/index.html
• Apache Maven, version 3.9.9 or newer: https://maven.apache.org/download.cgi
• (optional) An IDE such as IntelliJ IDEA (free community editor or ultimate edition for students):
https://www.jetbrains.com/idea.
• Python, version 3.11 or newer, for running benchmarking scripts and ChocoPy programs in a
Python interpreter: https://www.python.org/downloads
If you are using Linux or MacOS, we recommend using a package manager such as apt or homebrew. Otherwise, you can simply download and install the software from the websites listed above. We also recommend using an IDE to develop and debug your code. In IntelliJ, you should be able to import the repository as a Maven project.
4 External Documentation
• Chocopy Language Reference: https://sites.google.com/berkeley.edu/cs164fa24/ chocopy
• Chocopy Implementation Guide https://drive.google.com/file/d/ 1jncSRMncXHgDApRCQGV3kyUa3X6TlsLe/view
• RISC-V specification: https://riscv.org/specifications
• Venus wiki: https://github.com/kvakil/venus/wiki. We are using a modified version of
Venus for this course. Section 6 describes our simulator and its differences from the original.
• CS 61C RISC-V Reference Card: https://cs61c.org/fa24/pdfs/resources/
reference-card.pdf
• Godbolt Compiler Explorer: https://godbolt.org/z/GfxG6dG8E. This is useful for seeing
what RISC-V assembly C compilers generate.
5 Files and directories
The assignment repository contains a number of files that provide a skeleton for the project. Some of these files should not be modified, as they are essential for the assignment to compile correctly. Other files must be modified in order to complete the assignment. You may also have to create some new files in this directory structure. The list below summarizes each file or directory in the provided skeleton.
• pom.xml: The Apache Maven build configuration. You do not need to modify this as it is set up to compile the entire pipeline. We will overwrite this file with the original pom.xml while autograding.
• src/: The src directory contains manually editable source files, some of which you must modify for this assignment. Classes in the chocopy.common package may not be modified, because they are common to your assignment and the reference implementation / test framework. However, you are free to duplicate/extend these classes in the chocopy.pa3 package or elsewhere. Section 8 describes in detail how the provided starter code is meant to be extended without requiring any duplication.
– src/main/java/chocopy/pa3/StudentCodeGen.java: This class is the entry point to the code generation phase of your compiler. It contains a single method: public static String process(Program program, boolean debug). The first argument to this method will be the typed AST produced by the semantic analyis stage, and the return value should be the RISC-V assembly program. The second argument to this method is true if the –debug flag is provided on the command line when invoking the compiler.
– src/main/java/chocopy/pa3/CodeGenBase.java: This abstract class provides a lot of infrastructure for setting up data structures and definitions for performing code generation. You may need to make minor modifications to this class, but you should not need to make substantial edits, as it is meant to be easily extensible via subclassing. In addition, reading some of the code in this class may be helpful. Section 8.1 describes this class in detail.
– src/main/java/chocopy/pa3/CodeGenImpl.java: This class contains a skeleton imple- mentation of the abstract class chocopy.common.CodeGenBase. You will have to modify this file to emit assembly code for top-level statements and function bodies. Section 8 describes several support classes in detail.
– src/main/java/chocopy/pa3/RiscV.java: This class contains definitions for an in- memory, structured representation of RISC-V assembly. Unless you are looking for a challenge (refer to Section 8.5), you do not need to interact with this file.
– src/main/java/chocopy/pa3/RiscVInstrFactory.java: This class contains convenience methods for constructing in-memory, structured RISC-V assembly instructions. Unless you are looking for a challenge (refer to Section 8.5), you do not need to interact with this file.
– src/main/java/chocopy/common/astnodes/*.java: This package contains one class for every AST-node kind that appears in the input JSON. These are the same classes that were provided in previous assignments.
– src/main/java/chocopy/common/analysis/SymbolTable.java: This class contains a sample implementation of a symbol table, which is a essentially a map from strings to values of a generic type T. This is the same class that was provided in the previous assign- ment.
– src/main/java/chocopy/common/analysis/types/*.java: This package contains a hier- archy of classes for representing types in the typed AST. These are the same classes that were provided in the previous assignment.
– src/main/java/chocopy/common/codegen/*.java: These classes contain all the support classes for the extensive starter code provided to you. Section 8 describes these classes in detail, including how you can extend some of them.
– src/main/asm/chocopy/common/*.s: These files contain assembly-language implementa- tions of built-in functions, which CodeGenBase copies into the output program. You can use the same technique for adding additional runtime support routines (for things such as string concatenation). Just put such routines in a directory src/main/asm/chocopy/pa3 and look to see how CodeGenBase uses the emitStdFunc routines.
– src/test/data/pa3: This directory contains ChocoPy programs for testing your code generator.
∗ /sample/*.py – Sample test programs covering a variety of semantics that you will need to implement in this assignment. Each sample program is designed to test a small number of language features.
∗ /sample/*.py.out.typed – Typed ASTs corresponding to the test programs. These will be the inputs to your code generator.
∗ /sample/*.py.out.typed.s.result – The results of executing the test programs. The assembly programs generated by your compiler should produce exactly these results when executed in order for the corresponding tests to pass.
∗ /benchmarks/*.py – Non-trivial benchmark programs, meant to test the overall work- ing of your compiler. The testing for these programs will be done in the same manner as done for the tests in the sample directory, but these tests will have higher weight during grading.
∗ /benchmarks/*.py.out.typed – Typed ASTs corresponding to the benchmark test programs. These will be the inputs to your code generator.
∗ /benchmarks/*.py.out.typed.s.result – The results of executing the benchmark programs.
Computer Science Tutoring
• target/: The target directory will be created and populated after running mvn clean package. It contains automatically generated files that you should not modify by hand. This directory will be deleted before your submission.
• chocopy-ref.jar: A reference implementation of the ChocoPy compiler, provided by the in- structors.
• README.md: You will have to modify this file with a writeup.
• checkpoint tests.txt: List of tests used for grading at the checkpoint (ref. Section 9). This
list is same as Appendix A of this document.
6 Execution Environment
The target architecture for this code generation assignment is RV32IM, which is the 32-bit version of RISC-V that supports basic integer arithmetic plus the multiplication (and division) extensions. In order to execute RISC-V code in a platform-independent manner, we will be using a version of the Venus simulator, which was originally developed by Keyhan Vakil. Venus dictates the execution environment, which includes the initial values of registers, the addresses of the various memory segments, and the set of supported system calls. Section 4 points to some documentation
for Venus.
6.1 Venus 164
To support the goals of this project, our version of Venus has been modified—we refer to this variant as Venus 164. The modifications mainly try to make the assembly language conform to the one supported by the official GNU-based RISC-V toolchain.
• .word directive: We have added support for emitting addresses in the data segment using the syntax .word
The simulator is distributed both as a JAR in our instructional maven repo (for use with the auto-grader) and in web form to enable interactive debugging. The web version of Venus 164 is hosted at the following URL:
https://chocopy.org/venus.html
You can enter RISC-V assembly code in the editor and then switch to the simulator tab to run the program. You can add break-points and step through instructions one at a time to observe changes to the registers and to the memory. The CS164 staff cannot provide support on using the web UI.
7 Assignment goals
The objective of this assignment is to build a code generator for ChocoPy, which takes as input a typed AST corresponding to a ChocoPy program in JSON format, and produces as output a RISC-V assembly program that can execute in the Venus 164 execution environment.
7.1 Running the compiler
7.1.1 Four-step process
The process of executing a ChocoPy program consists of four basic steps:
1. java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–pass=r
2. java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–pass=.r
3. java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–pass=..s
4. java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–run
where
7.1.2 Reference implementation
To observe the assembly program produced by the reference implementation, replace step 3 above with the following command:
java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–pass=..r
Students are not allowed to copy-paste reference assembly into their implementation, or use ref- erence assembly verbatim. The autograders will check to make sure that students abide by this policy. Thus, the assembly program that your compiler generates need not match (and indeed should not match) the program generated by the reference compiler. See Section 7.2 onward for what is expected from your compiler.
7.1.3 Shortcuts: chained commands
To simplify development, you can also club the above commands into a single command that pipes the output of the each phase to the input of the next phase. The combined command to produce an assembly file from an input ChocoPy program (which is equivalent to running steps 1–3) is as follows:
java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–pass=rrs
You can also add –run at the end of this chain to actually execute an input ChocoPy program in one step.
java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–pass=rrs –run
Finally, the command
java -cp “chocopy-ref.jar:target/assignment.jar” chocopy.ChocoPy \
–pass=rrr –run
will run the combined command with the reference implementation.
In any command, you can omit the –out
standard output instead of a file.
7.2 Input/output specification
The interface to your code generation assignment will be the static method StudentCodeGen.process(). The input to this method is a typed AST in JSON format, corresponding to a semantically valid and well-typed ChocoPy program. The typed AST will be in the same format that was used as the output format for the previous assignment. The field inferredType will be non-null for every expression in the AST that evaluates to a value. The output is expected to be a RISC-V assembly program, which is executed in the Venus 164 environment. The assembly program that your compiler generates should not match the program generated by the reference compiler, as students are not allowed to use reference assembly verbatim. The reference compiler performs several optimizations, that you are not expected to match. Your goal is to independently produce an assembly program which implements the operational semantics given in the ChocoPy reference manual.
Code Help
7.3 Validation
Testing is performed by executing the generated RISC-V program in Venus 164 and comparing the contents of the output stream with that produced by the reference-implementation-generated program. The program is expected to behave as per the operational semantics defined in the ChocoPy language manual: chocopy language reference.pdf. The output should contain a sequence of lines, where the ith line corresponds to the string representation of the str, int, or bool object provided as argument to the ith dynamic invocation of the predefined print function.
7.4 Memory Management
In this assignment, all compiled ChocoPy programs will have 32MB of memory to work with. The register gp will point to the beginning of the heap before the first top-level statement is executed. Garbage collection (GC) has not yet been implemented in the reference implementation; therefore, newly allocated objects block space for the entire remaining duration of the program. You are not expected to implement GC in this assignment, though the heap and object layouts have been designed in such a way that GC can be easily integrated. The tests used by the auto-grader require far less than 32MB of memory to execute.
7.5 Error handling
In case of run-time errors, your program is expected to print an appropriate error message and exit with an appropriate exit code. The error messages and exit codes used by the reference implementation are described in chocopy implementation guide.pdf. Fortunately, you do not have to hand-code the error messages or corresponding exit codes. The errors corresponding to invalid arguments to predefined functions and out-of-memory are generated by the code that has already been provided to you. For errors corresponding to operations on None, division by zero, and index out-of-bounds, we have provided you built-in routines that are emitted in the method CodeGenImpl.emitCustomCode(). Your generated programs can simply jump to one of these labels when the appropriate condition is met and the error message will be printed for you before aborting the program with an appropriate exit code. You do need to jump to these error handlers exactly when the appropriate condition is met. A run-time error is raised when one of the pre-conditions in the operational semantics fails to be true. For example, in the operational rule [dispatch], if the object on which a method is dispatched turns out to be the value None, then the second line fails to be true; therefore, the run-time error is reported after evaluating the object expression but before evaluating any of the arguments of the method call. These rules have been designed to conform to the error-reporting logic used by Python.
You will not be tested on program executions that lead to arithmetic integer overflow or out- of-memory.
7.6 README
Before submitting your completed assignment, you must edit the README.md and provide the following information: (1) names of the team members who completed the assignment, (2) acknowl- edgements for any collaboration or outside help received, and (3) how many late hours have been consumed (refer to the course website for grading policy).
8 Implementation Notes
In this assignment, you are provided a significant amount of skeleton code. You are not strictly required to use this code; however, we strongly recommend that you do, since it performs about half of the work required to implement a code generator for ChocoPy. This section describes the design of the skeleton code. The code itself is also heavily documented using Javadoc-style comments.
Although the entry point for this assignment is the static StudentCodeGen.process() method, this method does little more than handle input/output. Most of the heavy lifting is done within the CodeGenImpl class in the chocopy.pa3 package, which itself is a sub-class of CodeGenBase. The CodeGenImpl class contains skeletons for emitting RISC-V code corresponding to top level statements and function bodies. You are expected to edit this skeleton and emit code corresponding to all types of program statements and expressions. In doing so, you will most likely want to use inherited fields and methods from the base class, CodeGenBase.
8.1 Code generation base
The following tasks have already been performed by CodeGenBase:
1. Analysis of the entire program to create descriptors for classes, functions/methods, variables, and attributes. These descriptors, whose class names end with Info, are placed in appropriate symbol tables. The symbol tables and Info objects are described in Section 8.2. The globalSymbols field in CodeGenBase references the global symbol table. Every FuncInfo object references its corresponding function’s symbol table, which takes into account local definitions, implicitly inherited names, as well as explicit nonlocal/global declarations. You likely do not need to modify the symbol tables in this assignment.
2. Code generation for prototypes of every class (refer to chocopy implementation guide.pdf to understand what prototype means). The ClassInfo objects contain labels pointing to their corresponding prototypes in memory.
3. Code generation for method dispatch tables for every class. The ClassInfo objects contain labels pointing to their corresponding dispatch tables in memory.
4. Code generation for global variables. For every global variable in the program, there exists exactly one GlobalVarInfo object in the global symbol table (these may be inherited by a function’s symbol table). A GlobalVarInfo object contains a label pointing to the global variable allocated in memory. Global variables are emitted in the data segment using their initially defined values from the ChocoPy program.
5. Management of and code generation for constants. The constants field in CodeGenBase refer- ences a manager for constant integers, booleans, and strings encountered in the program. The method constants.getIntConstant(int x) method returns a label that points to a globally- allocated ChocoPy int object having the same value as the Java integer x. Similar methods are available for booleans and strings. The constants’ manager performs caching, so that ev- ery distinct constant label references a unique constant. Once code is emitted for all program statements, the CodeGenBase emits all encountered constants to the global data segment.
6. Code generation for predefined functions and built-in routines. The CodeGenBase class emits bodies of predefined functions such as len, print, input, and object. init , as well as built- in routines such as abort and alloc. Although you do not need to modify this logic, you
may want to read through the code that emits these functions/routines in order to get some inspiration for how to emit code in your own CodeGenImpl for user-defined functions.
7. Initialization of the heap and clean exit. The CodeGenBase class emits some start-up code that should execute before the first top-level statement is executed. The start-up code includes logic for initializing the heap and setting the initial value of fp. The CodeGenBase class also emits some tear-down code that should execute after the last top-level statement has been executed. The tear-down code performs a successful exit from the execution environment. The code that you will emit in the method CodeGenImpl.emitTopLevel() will be placed in-between the start- up and tear-down logic.
To summarize, the CodeGenBase takes care of populating symbol tables, emitting everything that needs to be emitted to the global data segment, as well as emitting boilerplate code to the text segment. Your task in this assignment is to leverage the symbol tables and other available utilities to emit code in the text segment by filling in the CodeGenImpl.
8.2 Symbol table
A symbol table maps identifiers to their corresponding symbol descriptors. This mapping changes depending on the current scope. The starter code creates the following types of symbol descriptors in its analysis (you likely do not need to add to this hierarchy):
• FuncInfo: A descriptor for functions and methods. A function has an associated depth: global functions and methods have a depth of 0, whereas nested functions that are defined within a function of depth d have a depth of d + 1. A FuncInfo object contains the function’s depth, its symbol table, its parameter list (a list of names), its local variables (a list of StackVarInfo objects), a label corresponding to its entry point, and a reference to the FuncInfo of its enclosing function (if applicable). The FuncInfo class also contains a utility method, getVarIndex(), to retrieve the index of a parameter or local variable in the function’s activation record.
• ClassInfo: A descriptor for classes. A ClassInfo object corresponding to a class contains its type tag, its attributes (a list of AttrInfo objects), its methods (a list of FuncInfo objects), a label corresponding to its prototype and a label corresponding to its dispatch table. This class also contains utility methods to get the index of an attribute in the object layout or the index of a method in the dispatch table.
• GlobalVarInfo: A descriptor for a global variable. A GlobalVarInfo object simply contains the label of its corresponding global variable.
• AttrInfo: A descriptor for class attributes. An AttrInfo object contains the initial value of its corresponding attribute, represented as a label that points to a constant allocated in the data segment; the label may be null in case of an initial value of None.
• StackVarInfo: A descriptor for variables allocated on the stack, such as parameters and local variables. A StackVarInfo object contains the initial value of its corresponding variable, rep- resented as a label that points to a constant allocated in the data segment; the label may be null in case of an initial value of None. A StackVarInfo object also references the FuncInfo object corresponding to the function which defines the stack variable; this pointer is useful for determining the static depth of a stack-allocated variable, which may be necessary when emitting code for accessing non-local variables.
浙大学霸代写 加微信 cstutorcs
8.3 Labels
The class Label is heavily used throughout the provided code framework to represent labels in the generated assembly. A Label object simply encapsulates the name of a label as a string. Several instruction-emitting methods of the RiscVAsmWriter expect a Label as an argument.
Labels can be created in two ways: either by directly instantiating a new Label object with a specific string provided as an argument to its constructor, or by invoking the utility method generateLocalLabel() defined in CodeGenBase. The utility method generates a fresh label named label
You should only jump to global labels using unconditional jumps such as jr or jal. If you want to conditionally branch to a global label (e.g