CS186 Project 3 Joins and Query Optimization

Part 0: Skeleton Code

To read, or not to read, that is the question
In this project you’ll be implementing some common join algorithms and a limited version of the Selinger optimizer. We’ve provided a brief introduction into the new parts of the code base you’ll be working with.
For Part 1 we recommend you read through:
common/iterator – Details on backtracking iterators, which will be needed to implement joins
Join Operators – Details on the base class of the join operators you’ll be implementing and some useful helper methods we’ve provided
query/disk – Details on some useful classes for implementing Grace Hash Join and External Sort
For Part 2 we recommend you read through:
Scan and Special Operators – These talk about additional operators that you’ll use while creating query plans
query/QueryPlan.java – Gives a high level overview of a QueryPlan and some details on how to create and work with them
common/iterator
The common/iterator directory contains an interface called a BacktrackingIterator. Iterators that implement this will be able to mark a point during iteration, and reset back to that mark. For example, here we have a backtracking iterator that just returns 1, 2, and 3, but can backtrack:
BackTrackingIterator iter = new BackTrackingIteratorImplementation();
iter.next(); // returns 1
iter.next(); // returns 2
iter.markPrev(); // marks the previously returned value, 2
iter.next(); // returns 3
iter.hasNext(); // returns false
iter.reset(); // reset to the marked value (line 3)
iter.hasNext(); // returns true
iter.next(); // returns 2
iter.markNext(); // mark the value to be returned next, 3
iter.next(); // returns 3
iter.hasNext(); // returns false
iter.reset(); // reset to the marked value (line 11)
iter.hasNext(); // returns true
iter.next(); // returns 3
ArrayBacktrackingIterator implements this interface. It takes in an array and returns a backtracking iterator over the values in that array.
query/QueryOperator.java
The query directory contains what are called query operators. A single query to the database may be expressed as a composition of these operators. All operators extend the QueryOperator class and implement the Iterable interface. The scan operators fetch data from a single table. The remaining operators take one or more input operators, transform or combine the input (e.g. projecting away columns, sorting, joining), and return a collection of records.
Join Operators
JoinOperator.java is the base class of all the join operators. Reading this file and understanding the methods given to you can save you a lot of time on Part 1. It provides methods you may need to deal with tables and the current transaction. You should not be dealing directly with Table objects nor TransactionContext objects while implementing join algorithms in Part 1 (aside from passing them into methods that require them). Subclasses of JoinOperator are all located in query/join.
Some helper methods you might want to be aware of are located here.
Scan Operators
The scan operators fetch data directly from a table.
SequentialScanOperator.java – Takes a table name provides an iterator over all the records of that table
IndexScanOperator.java – Takes a table name, column name, a PredicateOperator (>, <, <=, >=, =) and a value. The column specified must have an index built on it for this operator to work. If so, the index scan will use take advantage of the index to yield records with columns satisfying the given predicate and value (e.g. salaries.yearid >= 2000) efficiently
Special Operators
The remaining operators don’t fall into a specific category, but rather perform some specific purpose.
SelectOperator.java – Corresponds to the σ operator of relational algebra. This operator takes a column name, a PredicateOperator (>, <, <=, >=, =, !=) and a value. It will only yields records from the source operator for which the predicate is satisfied, for example (yearid >= 2000)ProjectOperator.java – Corresponds to the π operator of relational algebra. This operator takes a list of column names and filters out any columns that weren’t listed. Can also compute aggregates, but that is out of scope for this project
SortOperator.java – Yields records from the source operator in sorted order. You’ll be implementing this in Part 1
Other Operators
These operators are out of scope and directly relevant to the code you’ll be writing in this project.
MaterializeOperator.java – Materializes the source operator into a temporary table immediately, and then acts as a sequential scan over the temporary table. Mainly used in testing to control when IOs take place
GroupByOperator.java – Out of scope for this project. This operator accepts a column name and yields the records of the source operator but with the records grouped by their value and each separated by a marker record. For example, if the source operator had singleton records [0,1,2,1,2,0,1] the group by operator might yield [0,0,M,1,1,1,M,2,2] where M is a marker record.
query/disk
The classes in this directory are useful for implementing Grace Hash Join and External Sort, and correpond to the concept of “partitions” and “runs” used in those topics respectively. Both classes have an add method that can be used to insert a record into the partition/run. These classes will automatically buffer insertions and reads so that at most one page is needed in memory at a time.
query/aggr
The classes and functions in this directory implement aggregate functions, and are not necessary to complete the project (though you’re free to browse through them if you’re interested).
query/QueryPlan.java

This is the volcano model, where the operators are layered atop one another, and each operator requests tuples from the input operator(s) as it needs to generate its next output tuple. Note that each operator only fetches tuples from its input operator(s) as needed, rather than all at once!
A query plan is a composition of query operators, and it describes how a query is executed. Recall that SQL is a declarative language – the user does not specify how a query is run, and only what the query should return. Therefore, there are often many possible query plans for a given query.
The QueryPlan class represents a query. Users of the database create queries using the public methods (such as join(), select(), etc.) and then call execute to generate a query plan for the query and get back an iterator over the resulting data set (which is not fully materialized: the iterator generates each tuple as requested). The current implementation of execute simply calls executeNaive, which joins tables in the order given; your task in Part 2 will be to generate better query plans.
SelectPredicate
SelectPredicate is a helper class inside of QueryPlan.java that stores information about that selection predicates that the user has applied, for example someTable.col1 < 186. A select predicate has four values that you can access: tableNameand columnName specify which column the predicate applies to operator represents the type of operator being used (for example <, <=, >, etc…)
value is a DataBox containing a constant value that the column should be evaluated against (in the above example, 186 would be the value).
All of the select predicates for the query are stored inside the selectPredicates instance variable.
JoinPredicate
JoinPredicate is a helper class inside of QueryPlan.java that stores information about the conditions on which tables are joined together, for example: leftTable.leftColumn = rightTable.rightColumn. All joins in RookieDB are equijoins. JoinPredicates have five values:
joinTable: the name of one of the table’s being joined in. Only used for toString()
leftTable: the name of the table on the left side of the equality
leftColumn: the name of the column on the left side of the equality
rightTable: the name of the table on the right side of the equality
rightColumn: The name of the column on the right side of the equality
All of the join predicates for the query are stored inside of the joinPredicates instance variable.
Interface for querying
You should read through the Database.java section of the main overview and browse through examples in src/test/java/edu/berkeley/cs186/database/TestDatabase.java to familiarize yourself with how queries are written in our database.
After execute() has been called on a QueryPlan object, you can print the final query plan:
Iterator result = query.execute();
QueryOperator finalOperator = query.getFinalOperator();
System.out.println(finalOperator.toString());
-> SNLJ on S.sid=E.sid (cost=6)
-> Seq Scan on S (cost=3)
-> Seq Scan on E (cost=3)

Part 1: Join Algorithms

In this part, you will implement some join algorithms: block nested loop join, sort merge, and grace hash join. You can complete Task 1, Task 2 and Task 3 in any order you want. Task 4 is dependent on the completion of Task 3.
Aside from when the comments tell you that you can do something in memory, everything else should be streamed. You should not hold more pages in memory at once than the given algorithm says you are allowed to. Doing otherwise may result in no credit.
Note on terminology: in lecture, we sometimes use both block and page describe the unit of transfer between memory and disk. In the context of join algorithms, however, page refers to the unit of transfer between memory and disk, and block refers to a set of one or more pages. All uses of the word block in this part refer to this second definition (a set of pages).
Convenient assumptions:
For all iterators that will be implemented in this project you can assume hasNext() will always be called before next().
Any Record object provided through an argument or as an element of a list or iterator will never be null.
For testing purposes, we will not be testing behavior on invalid inputs (null objects, negative buffer sizes or buffers too small to perform a join, invalid queries, etc…). You can handle these inputs however you want, or not at all.
Your join operators, sort operator, and query plans do not need to account for underlying relations being mutated during their execution.
Your Tasks
Task 1: Nested Loop Joins
Simple Nested Loop Join (SNLJ)
SNLJ has already been implemented for you in SNLJOperator. You should take a look at it to get a sense for how the pseudocode in lecture and section translate to code, but you should not copy it when writing your own join operators. Although each join algorithm should return the same data, the order differs between each join algorithm, as does the structure of the code. In particular, SNLJ does not need to explicitly manage pages of data (it only ever needs the next record of each table, and therefore can just use an iterator over all records in a table), whereas all the algorithms you will be implementing in this part must explicitly manage when pages of data are fetched from disk.
Page Nested Loop Join (PNLJ)
PNLJ has already been implemented for you as a special case of BNLJ with B=3. Therefore, it will not function properly until BNLJ has been properly implemented. The test cases for both PNLJ and BNLJ in TestNestedLoopJoin depend on a properly implemented BNLJ.
Block Nested Loop Join (BNLJ)
You should read through the given skeleton code in BNLJOperator. The next and hasNext methods of the iterator have already been filled out for you, but you will need to implement the fetchNextRecord method, which should do most of the heavy lifting of the BNLJ algorithm.
There are also two suggested helper methods: fetchNextLeftBlock, which should fetch the next non-empty block of left table pages from leftSourceIterator, and fetchNextRightPage, which should fetch the next non-empty page of the right table (from rightSourceIterator).
The fetchNextRecord method should, as its name suggests, fetches the next record of the join output. When implementing this method there are 4 important cases you should consider:
Case 1: The right page iterator has a value to yield
Case 2: The right page iterator doesn’t have a value to yield but the left block iterator does
Case 3: Neither the right page nor left block iterators have values to yield, but there’s more right pages
Case 4: Neither right page nor left block iterators have values nor are there more right pages, but there are still left blocks
We’ve provided the following animation to give you a feel for how the blocks, pages, and records are traversed during the nested looping process. Identifying where each of these cases take place in the diagram may help guide on what to do in each case.

Animations of SNLJ and PNLJ can be found here. Loaded left records are highlighted in blue, while loaded orange records are highlighted in orange. The dark purple square represents which pair of records are being considered for the join, while light purple shows which pairs have already been considered.
Once you have implemented BNLJOperator, all the tests in TestNestedLoopJoin should pass.
Task 2: Hash Joins
Simple Hash Join (SHJ)
We’ve provided an implementation of Simple Hash Join which can be found in SHJOperator.java. Simple Hash Join performs a single pass of partitioning on only the left records before attempting to join. Read the code for SHJ carefully as it should give you a good idea of how to implement Grace Hash Join.
Grace Hash Join (GHJ)
Everything you will need to implement will be done in GHJOperator.java. You will need to implement the functions partition, buildAndProbe, and run. Additionally, you will have to provide some inputs in getBreakSHJInputs and getBreakGHJInputs which will be used to test that Simple Hash Join fails but Grace Hash Join passes (tested in testBreakSHJButPassGHJ) and that GHJ breaks (tested in testGHJBreak) respectively.
The file Partition.java in the query/disk directory will be useful when working with partitions. Read through the file and get a good idea what methods you can use.
Once you have implemented all the methods in GHJOperator.java, all tests in TestGraceHashJoin.java will pass. There will be no hidden tests for Grace Hash Join. Your grade for Grace Hash Join will come solely from the released public tests.
Task 3: External Sort
The first step in Sort Merge Join is to sort both input relations. Therefore, before you can work on implementing Sort Merge Join, you must first implement an external sorting algorithm.
Recall that a “run” in the context of external mergesort is just a sequence of sorted records. This is represented in SortOperator by the Run class (located in query/disk/Run.java). As runs in external mergesort can span many pages (and eventually span the entirety of the table), the Run class does not keep all its data in memory. Rather, it creates a temporary table and writes all of its data to the temporary table (which is materialized to disk at the buffer manager’s discretion).
You will need to implement the sortRun, mergeSortedRuns, mergePass, and sort methods of SortOperator.
sortRun(run) should sort the passed in data using an in-memory sort (Pass 0 of external mergesort).
mergeSortedRuns(runs) should return a new run given a list of sorted runs.
mergePass(runs) should perform a single merge pass of external mergesort, given a list of all the sorted runs from the previous pass.
sort() should run external mergesort from start to finish, and return the final run with the sorted data
Each of these methods may be tested independently, so you must implement each one as described. You may add additional helper methods as you see fit.
Once you have implemented all four methods, all the tests in TestSortOperator should pass.
Task 4: Sort Merge Join
Now that you have a working external sort, you can now implement Sort Merge Join (SMJ).
For simplicity, your implementation of SMJ should not utilize the optimization discussed in lecture in any case (where the final merge pass of sorting happens at the same time as the join). Therefore, you should use SortOperator to sort during the sort phase of SMJ.
You will need to implement the SortMergeIterator inner class of SortMergeOperator.
Your implementation in SortMergeOperator and your implementation of SortOperator may be tested independently. You must not use any method of SortOperator in SortMergeOperator, aside from the public methods given in the skeleton (in other words: don’t add a new public method to SortOperator and call it from SortMergeOperator).
Once you have implemented SortMergeIterator, all the tests in TestSortMergeJoin should pass.
Submission
Follow the submission instructions here for the Project 3 Part 1 assignment on Gradescope. If you completed everything you should be passing all the tests in the following files:
database.query.TestNestedLoopJoin
database.query.TestGraceHashJoin
database.query.TestSortOperator
database.query.TestSortMergeJoin