4/20/24, 8:34 PM CrustyDB 1: Page Milestone
CrustyDB 1: Page Milestone
Due Date: Friday, April 19th, 2024 at 11:59 am (Noon)
0 Points Possible
Welcome to CrustyDB! CrustyDB is an academic Rust-based relational database management system built by ChiData at The University of Chicago (https://uchi-db.github.io/chidatasite/) , and it is the work of many contributors. It is designed to be an academic project that can be used for teaching and a testbed for research projects. We will use the CrustyDB platform to teach you about database system internals.
The CrustyDB Project
Our approach to building the database system is going to be bottom-up. We start with the storage manager, the entity responsible for storing the data on the disk, and then work our way upwards to the query processing engine.
The project is divided into several milestones, each introducing you to a new concept in database systems. The milestones are as follows:
Crusty DB 1 – Page (This Milestone): You will build a system to persist data onto fixed-sized pages that store variable values. This milestone requires you to build out a slotted-page storage system.
Crusty DB 2 – Heapstore: In the next milestone, you will continue to build the storage engine by implementing a heap file storage manager called the heapstore. If you are taking the graduate version of this course, you will also implement a buffer pool manager, which caches pages in memory and implements a page replacement policy.
Crusty DB 3 – Query Operators: In the third milestone, you will implement a set of query operators that can be used to execute queries on the data stored in the database.
Crusty DB 4 – Choose your own Adventure (Graduate Students Only): If you are taking the graduate version of this course, you will additionally be tasked with implementing an additional feature of CrustyDB of your choosing.
Getting Started
We will use a separate upstream repository to manage the files for CrustyDB. You will need to set up a private repository using Github Classroom, separate from the homework respository you set up in HW0.
Add Comment
https://canvas.uchicago.edu/courses/55320/assignments/659051 1/16
@github-classroom[bot] has invited you to collaborate on the
uchicago-cmsc23500-spr-2024/crustydb-GITHUB_USERNAME repository.
https://github.com/uchicago-cmsc23500-spr-2024/crustydb-GITHUB_USERNAME
GITHUB_USERNAME
cmsc23500-spr-2024/crustydb-jrandom.git
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
Open the following URL in a browser tab: https://classroom.github.com/a/SW3F_tWN (https://classroom.github.com/a/SW3F_tWN)
Then complete the following steps:
You may link your GitHub account to a CNetID if you have not already done so. You may skip this step if you don’t find your CNetID in the list.
You must click “Accept this assignment”, or your repository will not be created. Do not skip this step!
You will be taken to a page saying your repository is being prepared. It will usually be ready within a few seconds, and you can just reload the page to confirm your repository is set up. Please note that this will sometimes take a few minutes.
If you run into any issues, please ask for help on Ed.
You may receive an email from GitHub that looks like the following:
You can go ahead and click on the “view invitation” button to accept the invitation.
Important: Do NOT follow the repository initialization instructions provided by GitHub. Follow the instructions provided in the next section instead.
Now, you will finish setting up the repository you just requested. You will only need to do this once since you will be using the same repository for all the CrustyDB milestones this quarter.
Verify that your repository has been created on GitHub. To do so, open a browser tab to this URL:
where is replaced by your GitHub username. If you are not able to open this URL successfully, please ask for help.
Get the SSH URL of the repository from GitHub. In the browser window that you opened to your repository, under “Quick setup — if you’ve done this kind of thing before”, make sure the “SSH” button is selected. Your repository URL should look something like this:
refer to this URL as later on.
In a new terminal, run the commands listed below.
. (Except will be your GitHub username). We will
https://canvas.uchicago.edu/courses/55320/assignments/659051 2/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
Remember to replace GITHUB_USERNAME with your GitHub username and REPO_URL with the SSH URL that appears on your repository page on GitHub (as described in Step 2)
Here are the commands:
$ mkdir -p cmsc23500-spr-2024/crustydb-GITHUB_USERNAME
$ cd cmsc23500-spr-2024/crustydb-GITHUB_USERNAME
$ git init
$ git remote add origin REPO_URL
$ git remote add upstream
$ git pull upstream main
$ git branch -M main
$ git push -u origin main
Let’s do a few quick checks to make sure everything is properly set up.
You should see a few folders including src , docs and some files in the root of the repository.,
The contents of the README.md file will be the following:
# CrustyDB
This is the repository for the Academic Handout version of the CrustyDB project.
Please see your handout instructions for more information.
## CrustyDB 1 – Page Milestone
Implement the slotted page structure in the files `src/storage/heapstore/page.rs`
and `src/storage/heapstore/heap_page.rs`.
Please see your handout instructions for more information.
Do not modify any other files in the repository.
If not, your repository was not correctly initialized. If you followed GitHub’s instructions for initializing the repository instead of following our instructions, you will need to request a new repository. Please post on Ed to request that we remove your repository so you can request a new one.
Next, the files you see in your repository should also be in your repository on GitHub (to check, use a browser to view
https://github.com/uchicago-cmsc23500-spr-2024/crustydb-GITHUB_USERNAME
where GITHUB_USERNAME is replaced by your GitHub username.) Next, run this:
$ git remote -v
It should print the following (it is ok if the lines don’t appear in this exact order, as long as all four lines appear)::
origin (fetch)
origin (push)
https://canvas.uchicago.edu/courses/55320/assignments/659051 3/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
upstream (fetch)
upstream (push)
(where your GitHub username will appear instead of GITHUB_USERNAME ) Note
If you realize you entered the wrong value for either the note that you will not be able to update it by running error message that says something like:
or remote, please again (you will get an
Instead, you must run git remote set-url . For example, suppose your GitHub username
is SampleStudent and you entered the wrong value for the origin remote. You can update it like this
Finally, if you run:
$ git status .
in your crustydb-GITHUB_USERNAME directory, the result should be:
Source Code Layout
Now that you have obtained a copy of the CrustyDB started code, let’s explore its structure. CrustyDB is set up as a Rust workspace, and various modules/components of the database are broken into separate packages/crates. To build a specific crate (for example, the common crate), you would use the following command . Note if a package/crate depends on another crate (e.g. the heapstore crate depends on crate) those crates will automatically be built as part of the process.
The crates (currently located in src ) are:
common : shared data structures or logical components needed by everything in CrustyDB.
This includes things like tables, errors, logical query plans, ids, some test utilities, etc.
storage : A module that contains multiple storage managers (SM’s). The storage module is broken into several crates:
storage/heapstore : a storage manager for storing data in heap files. Additional crates (which will be added in future milestones) include:
git remote set-url origin
git remote add
error: remote origin already exists
On branch main
Your branch is up to date with ‘origin/main’.
nothing to commit, working tree clean
cargo build -p common
https://canvas.uchicago.edu/courses/55320/assignments/659051 4/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
cli-crusty : a command line interface client binary application that can connect and issue commands/queries to a running CrustyDB server.
optimizer : a crate for generating the query execution plan and for query optimization queryexe : responsible for executing queries. This contains the operator implementations as
well as the execution code.
server : the binary crate for running a CrustyDB server. This will glue all modules (outside a client) together.
: a near-empty crate for an optional milestone to implement transactions. the use of is embedded in many other crates, but can be safely ignored for the given milestones. There is also the use of a logical timestamp throughout many components. You can safely ignore this.
utilities : utilities for performance benchmarks that will be used by an optional milestone Note
For the first milestone, you will only be able to see the files
in heapstore and common crates. The other parts of CrustyDB will be added to the upstream repository from Milestone 2 onwards.
The CrustyDB Storage Manager
In CrustyDB a storage manager (SM) is responsible for persisting all data (aka writing it to disk). The storage manager is a generic interface that allows us to create different storage managers for storage strategies. We will be building an SM that stores data in fixed-sized pages within heap files, called the heapstore. Here’s some more information about storage managers in CrustyDB:
A SM in Crusty is agnostic to what is being stored, as it takes a request to store a value as bytes (a Vec
A container could represent a table, an index, a stored result, or anything else you want to persist.
For example, CrustyDB will create a container for each table/relation stored, and each record will be stored as a value .
Note that there is a 1-1 relationship between containers and heapfiles: you can think of ‘containers’ as wrappers that allow the SM to manage things like heapfile access permissions.
Other components in the CrustyDB system are responsible for converting data into bytes for the SM and interpreting bytes from the SM.
The SM manages access to all other interfaces related to storage tasks, such
as HeapFile or Page , and acts as the sole interface through which different components can
txn_manager
transaction
https://canvas.uchicago.edu/courses/55320/assignments/659051 5/16
Programming Help
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
persist data or interact with data on disk. Heapstore Design
The heapstore is a storage manager that manages data stored in heap files. Any value that is to be stored in the database is first converted to a byte array and then passed into the heapstore to be stored. You’ll learn much more about heap files and storage managers in the next milestone, but in brief:
A “heapfile” is a struct that manages a file. You can think of this as a wrapper around a file that contains additional metadata and methods to help you interact with the file.
The heapfile struct will contain info to help you utilize that file in the context of Crusty, but the file it’s linked to is just a regular file in your filesystem.
A heapfile organizes data as values stored in fixed-sized pages. Each page is a fixed size (defined by PAGE_SIZE in common::lib.rs ), and the heapfile will manage the pages in the file. We will specifically be using Slotted Page architecture to manage the values in a page.
Slotted Page Architecture
A heapfile is made up of a sequence of fixed-sized pages (the size being defined
by PAGE_SIZE in common::lib.rs ) concatenated together into one file. In this milestone, you will focus on one piece of functionality of the heapstore crate, the page. A page is a fixed-sized data structure that holds variable-sized values (in our case, records) via slotted storage. In slotted storage, each record inserted into a page is associated with a slot that points to a contiguous sequence of bytes on the page. A record/value will never be split across pages. The logic for managing values in a page is as follows:
https://canvas.uchicago.edu/courses/55320/assignments/659051 6/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
When a value is stored in a page, it is associated with a slot_id that should not change.
The page should always assign the lowest available slot_id to an insertion. Therefore, if the value associated with a given slot_id is deleted from the page, you should reuse
this slot_id (see more on deletion below).
While the location of the actual bytes of a value in a page can change, the slot_id should not. Note that this means that slot_id s are not tied to a specific location on the page either.
When storing values in a page, the page should insert the value in the ‘first’ available space in the page. We quote first as it depends on your implementation what first actually means.
If a value is deleted, that space should be reused by a later insert.
When free space is reclaimed and compacted together is up to you; however if there is enough free space in the page you should always accept an insertion request – even if the free space was previously used or is not contiguous.
A page should provide an iterator to return all of the valid values and their corresponding slot_id stored in the page.
The bytes that make up a page are broken into:
The header, which holds metadata about the page and the values it stores.
Restrictions on the header’s composition and size are detailed in the next section.
The body, which is where the bytes for values are stored, i.e., the actual records, and comprise the remaining bytes in the page.
Thus the entire page (i.e the header and body) must be packed into a contiguous byte array of size . Note that while values can differ in size, CrustyDB can reject any value that is larger than .
Id and Offset types For this milestone, we will use PageId and SlotId types for each distinct value that is stored in a page. The data types used for these Ids are also defined in common::ids .
A related type definition in page.rs is the Offset type, defined as follows: pub type Offset = u16;
The Offset type can be used to store a location within the page (as an offset) using just 2 bytes. Note that Rust will default most lengths or indexes to a usize which is 8 bytes on most systems (i.e. u64 ).
While it is usually not safe to downcast a usize to a smaller size type (i.e. Offset ), if you are careful as to what you are indexing into or checking the size of, you can downcast ( x as Offset ) – assuming your sizes or indexes do not exceed the Offset size bounds ($2^16$). When casting or defining variables, you should use the CrustyDB specific type for the purpose and not the original type, as these type definitions can change (e.g., always use SlotId when referring to a slot number and not u16 ).
https://canvas.uchicago.edu/courses/55320/assignments/659051 7/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
Page and Header Structure
8 bytes 6 bytes/slot ◄──────────────────────────► ◄──────────────────────► ┌───────────────────────────┬───────────────────────┬──────┐
▲ │ Page Metadata │ Slot 1 Metadata │ │ ▲ │ ├───────┬───────┐ ┌──────┼───────┬──────┐ ┌─────┤ … │ │ │ │PageId │ … │… │ … │ … │ … │..│… │ │ │
Page│ ├───────┴───────┴────┴──────┴──────┬┴──────┴──┴─────┴──────┤ │
… … …
│ Slot n Metadata │ │
├───────┬──────┐ ┌─────┤ │
│ … │ … │..│… │ │
▼ ├──────────────────────────────────┴───────┴──────┴──┴─────┤ │
│ ││ │▲ ││ ││ ││ │ │Free Space ││
││ Slot Offset ││
││ ┌─────────────────────┬────────────────────────────┤ │
││ │Value n │Value n-1 │ │ │ ▼ │ │ ││ ├───────┴─────────────────────┴────────────────────────────┤ │ │ ││ │ … … … … … │ │ │ ││ ├────────────────────────────────┬─────────────────────────┤ │ │Value 1 … │ Value 0 │ │ │││▼ └────────────────────────────────┴─────────────────────────┘
││ PAGE_SIZE ││
The header should be designed carefully. We will need enough metadata to manage the slots efficiently, but we also want to minimize the header size to maximize the space available for storing records. We also have to make some assumptions about the page structure to provide useful tests that you can use to verify your implementation.
As you decide on your metadata implementation, Here are some tips to guide you:
You will need to store the PageId of the page in the header. The storage manager must know which page it is reading/writing on.
You will need to store the number of slots on the page so you know how many slots are in use.
You will need to store the offset of the first free space on the page so you know where to insert the next record.
For each slot, you will need to know the offset of the record in the body of the page. As we support variable-length records, you will also need to store the length of each record in the slot metadata.
Consider how you will handle deleted records. As you complete the Page functionality, you will need to reuse the space of deleted records.
Suggested Steps
https://canvas.uchicago.edu/courses/55320/assignments/659051 8/16
Code Help
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
We will now provide a rough order of suggested steps to complete this milestone. Please note that this is not an exhaustive list of all required tests for the milestone, and you may want to write (and possibly contribute) additional tests
Note that this milestone will have more guidance than later, so you will be completing the required functions for much of this milestone. This milestone includes a series of unit and integration tests to test your page’s functionality. This module has a moderate amount of comments. Not all packages in CrustyDB will have the same level of comments. Working on a moderate-sized code base with limited comments and documentation is typical when working with large systems projects, and this should serve as an excellent introduction to this highly valued skill.
Read page.rs and heap_page.rs
The heap page is the basic building block of this milestone, so start with
the page.rs and heap_page.rs files. page.rs contains the Page struct, which is a generic page struct that can essentially store any byte array that is of size PAGE_SIZE . heap_page.rs contains the HeapPage trait, which defines a specific type of page that implements the slotted page architecture as described above.
Start by reading through the functions and comments to understand the functionality to be implemented. You may find it helpful to look at the unit tests in these files to check our understanding of their expected behavior before you code.
As you read through, think about what data structures/metadata you will need to allow for storing variable-sized values. You may end up adding new helper/utility functions.
Implement the Page struct in page.rs
First, in page.rs , you should implement the methods for the Page struct. All of the data for a page should be stored in the single 4096-byte array within the struct called data . The methods to be implemented for this struct are:
new – Create a new page with the given PageId . This function should initialize the data field of the struct and serialize PageID into the appropriate location of the data byte array.
get_page_id – Get the PageId of the page from the data field.
from_bytes – Create a new page from a byte array. This function should return a Page struct
with the given byte array.
to_bytes – Return a reference to the page as a byte array.
Note the inputs of the and to_bytes functions. They take/return explicit byte arrays that are of size . The heapstore storage manager will always
from_bytes
https://canvas.uchicago.edu/courses/55320/assignments/659051 9/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
provide/expect byte arrays of this size. In this case, do we need explicit byte conversions to serialize/deserialize the page using these methods?
After you complete these functions, you should be able to run and pass
the hs_page_create_basic test. As a reminder, to run a single test, you can use the following command:
cargo test -p heapstore hs_page_create_basic
You may use the same syntax to run any test mentioned in future sections.
You may see some warnings about unused variables or functions. You can ignore these warnings for now, as they are expected until you implement the rest of the milestone.
Implement the HeapPage trait in heap_page.rs
Once your basic page handling code is working, we can move on to the more advanced functions
listed in heap_page.rs .
The HeapPage trait is a trait that defines the basic functionality of a page in a heapfile. Your page should continue implementing the slotted page architecture described in the Slotted Page Architecture section.
Header Utility Functions
At this point, you should work on your header design. You should add any helper/utility functions that you think will be useful for managing the header information. You should think about how you will access and update the header information as well as information per slot. We don’t provide any explicit designs or tests for this, so you will need to think about what you need to implement to manage the header information effectively.
To add a helper function, you should add a function signature to the HeapPage trait and implement it in the implementation block for the Page struct. Do not modify any of the existing function signatures in the HeapPage trait, this will cause the tests to fail.
Once your header and slot design and associated helper functions are done, a natural starting point is to implement two utility functions as follows:
get_header_size for getting the current header size when serialized (which will be useful for figuring out how much free space you really have) and
get_free_space to determine the largest block of data free in the page. Once these are implemented, you should be able to pass
the hs_page_sizes_header_free_space test. https://canvas.uchicago.edu/courses/55320/assignments/659051
程序代写 CS代考 加QQ: 749389476
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
Inserting and Retrieving Values
With the header working, move on to add_value . This should enable hs_page_simple_insert to pass. This test adds some tuples (as bytes) to the page and then checks that (1) the slot ids are assigned in order and (2) that the largest free space and header size are aligned.
Once that tests work.
Deleting Values
is correctly implemented, you can move on to get_value and verify passes. At this point,
, hs_page_header_size_full , and hs_page_no_space should also
hs_page_get_value
hs_page_header_size_small
Next, implement the function delete_value , which should free up the bytes previously used by the slot_id and make the corresponding slot_id available for the next value that may be inserted. Start with the test hs_page_simple_delete , which only verifies that deleted values are gone. Once this is working, you will want to ensure you are reusing the space/slots. Please refer to the sections below, which explain space and SlotId reclamation as well as the expected compaction logic for a page. I suggest writing a utility function to find the first free space on a page. Here, you might want to explore inserting byte vectors of different sizes and see if you can replace/reuse the space as effectively as possible. You should have hs_page_delete_insert working also at this point.
Space Reclamation Example
The page should use deleted space again, but there is no requirement as to when the page should be reclaimed. In other words, you should never decline an add_value request when there is enough free space on the page, even if it is scattered in multiple blocks.
To visualize the free space reclamation, imagine the following scenario: We have a value AA, a value B, a value CC, and three free spaces (-). The SlotIds of AA, B, and CC are 0,1,2 respectively. The physical layout of the page is as follows:
After we delete B, the page looks like this:
Now, when inserting item D, we could use the free space ( – ) between A & C (resulting in
– ) or use free space – after CCC (resulting AA-CCD– ). Let’s go with the latter option. Either way, the slotId of D should be 1 (as we should re-use B’s SlotId ). Now the page looks like this:
Now, if we want to insert EE , we only have one viable spot/space. The slotId of EE should be 3 , and the page should look like this:
https://canvas.uchicago.edu/courses/55320/assignments/659051 11/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
Inserting FF should be rejected (i.e. return None) as it’s too large. No slotId should be assigned. Inserting G must be accepted as there is room. The slotId of G should be 4 .
Compaction
If we delete G and EE , we again have the following three spaces free. AA-CCD–
If the page attempts to insert NNNN , this request should not work as there is not enough space to hold NNNN (i.e. we should return None ). Conversely, an insert for HHH should work, as we have three spaces available.
However, since they are not contiguous spaces the page will need to compact the existing values such that three contigious spaces exist on the page. One of the possible layouts after compaction is as follows:
Now, the insertion of HHH should result in the following layout: AACCDHHH
Note that since slot 3 is the lowest available slot ID, HHH should be assigned a slot ID of 3. Note that when and how you compact data is up to you, but you must compact values if there is expected free space (accounting for necessary header data).
Page Iterator
The last component of this milestone is writing an iterator to ‘walk’ through all valid values stored on a page. This consuming iterator will move/take ownership of the page. The [Rust Documentation on Iterators (https://doc.rust-lang.org/book/ch13-02-iterators.html) will help you understand the iterator pattern and the traits involved in creating and managing iterators in Rust. To complete this task, you will have to:
Fill in the struct HeapPageIntoIter to hold the metadata for the iterator,
the next function in the impl Iterator for HeapPageIntoIter trait implementation
and into_iter function in impl IntoIterator for Page trait implementation that creates the iterator from a page.
With these functions, should pass. The tests will assume that values are returned by the iterator in ascending order.
Final Tests
hs_page_iter
https://canvas.uchicago.edu/courses/55320/assignments/659051 12/16
4/20/24, 8:34 PM CrustyDB 1: Page Milestone
After completing the iterator, all required functionality for this milestone should be complete, and you can run all the tests in the file by running the following command:
cargo test -p heapstore hs_page_
One of these tests, hs_page_stress_test is designed to test your implementation with a large number of randomized insertions and deletions. If you pass this test consistently, you can be fairly confident that your implementation is correct.
If not, use the Logging and Debugging tips below to debug your implementation. In the past, students have successfully fixed their implementations by replicating an exact stress test scenario by logging the insertions/deletions and verifying that the page state is as expected at each step.
Performance Considerations
This project is designed to be cumulative, and a successful implementation of this milestone is crucial for the next milestones. If you find that your implementation is slow (e.g., the stress test is taking a long time to run), you should evaluate your implementation for performance bottlenecks. Some suggestions for optimizing your implementation are:
Memory Allocation: You should avoid unnecessary memory allocations. For example, you should not make unnecessary copies of byte arrays.
Inline Functions: You should consider inlining your helper functions that are frequently called, mainly functions used to manage metadata or access individual Slots or Values in the Page. You can read more about the inline macro in the Rust Reference (https://doc.rust- lang.org/nightly/reference/attributes/codegen.html) . A function can be inlined in Rust by adding the following attribute to the function definition:
(Optional) Benchmarking: If you have completed the milestone and are looking to optimize your implementation, you can use the criterion crate to benchmark your code. criterion is a statistics-driven micro-benchmarking library that can help you identify performance bottlenecks in your code. You can read more about the criterion crate here (https://bheisler.github.io/criterion.rs/book/index.html) . We have included an optional benchmark in the heapstore crate, and you can run the benchmark by running the following command:
cargo bench -p heapstore
Benchmarking is not required for this milestone, but it will be a requirement in future milestones. Once you have the correctness of your milestone figured out, benchmarking can be a valuable too