CMSC 23500 Section 1 2 (Rehman) Lecture 7 – Indexing I

Introduction to Database Systems
CMSC 23500 Section 1/2 (Rehman) Lecture 7 – Indexing I
Slides from Aaron Elmore, Andy Pavlo and Silberchatz et. al.

Agenda and Updates
• Database Indexes
• ModelingDataAccessCosts
• BasicsofDataIndexing,Hashes
• Reading:Chapter14.1–14.2,14.5
• CrustyDB 1: Page Milestone is Out
• Due Friday, 19th April at 11:59 am (Noon)
• Rust Primer 2 is no longer a graded assignment
• Update your Calendars
• Will be an ungraded tutorial released along with CrustyDB 2
• Final Grade Breakdown will be Announced Soon

• Tuples Stored in Pages Stored in Files
• Database keeps track of the pages used for each table
• Storing Tuples within Pages:
• Fixed Length: Can be stored and accessed w/ fixed byte offsets
• Variable Length: Slotted Page architecture
• Organizing Tuples: • Sequential
• Heapfile

• We need a way to model how expensive it is to read a page, find a record, etc.
• Allows the Database to evaluate query plans
• We will use a simple cost model that only accounts for I/O latency (time). It does not account for CPU, memory transfer, if the page is in memory, etc.
• Going to account for:
• Time to seek to a page Ts
• Time to transfer a page from disk T
⋈ 𝜎!”#$%$&&
Index Scan

How to get the costs
• How do you get Ts and TT? • What about SSD or NVMe?

Cost Continued
• If we wanted to read one random page what would the cost be? (think of random meaning you do not know where cursor/pointer/head is at)

Cost Continued
• If we wanted to read p random pages what would the cost be? p(Ts + TT)

Cost Continued
• If we wanted to read a file with p pages sequentially what would the cost be?
Ts + p(TT)

Need a fast way to get to data
What is a RecordID?
Find a record where Attr0 = Min
Attr0 Attr1…
0.0 Bob 143
0.1 Adam 449
0.2 Jenny 457
0.3 Katie 454
Attr0 Attr1…
1.0 Gau 123
1.1 Joy 249
1.2 Anna 551
1.3 Min 44

Need a fast way to get to data
Create an index on a search key.
Index pages are made of index entries (search key and pointer).
Pointer is PK or RID.
Idx Page.0
Attr0 Attr1…
0.0 Bob 143
0.1 Adam 449
0.2 Jenny 457
0.3 Katie 454
Attr0 Attr1…
1.0 Gau 123
1.1 Joy 249
1.2 Anna 551
1.3 Min 44
Idx Page.1

Need a fast way to get to data
We can search the index for the value.
Why is this faster?
Idx Page.0
Attr0 Attr1…
0.0 Bob 143
0.1 Adam 449
0.2 Jenny 457
0.3 Katie 454
Attr0 Attr1…
1.0 Gau 123
1.1 Joy 249
1.2 Anna 551
1.3 Min 44
Idx Page.1

Real-Life Analog

Basic Concepts
• Index files are typically much smaller than the original file
• Two basic kinds of indices:
• Ordered indices: search keys are stored in sorted order
• Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”.

Index Evaluation Metrics
• Different Index Types support different types of Access efficiently. • Records with a specified value in the attribute
• Records with an attribute value falling in a specified range of values.
• Index Metrics: • Access time
• Insertion time
• Deletion time
• Space overhead

Ordered Indices
• In an ordered index, index entries are sorted on the search key value.
• Clustering index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.
• Also called the primary index
• The search key of a primary index is usually but not necessarily the primary key.
• Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index (or unclustered).
• Index-sequential file: a sequential file ordered using a search key with a clustering index on the search key.

Dense Index Files
• Dense index — An index record appears for every search-key value in the file.
• E.g. index on ID attribute of instructor relation
Srinivasan
Comp. Sci.
Comp. Sci.
Comp. Sci.
Elec. Eng.

Dense Index Files (Cont.)
• Dense index on dept_name, with instructor file sorted on dept_name
Srinivasan
Comp. Sci.
Comp. Sci.
Comp. Sci.
Elec. Eng.
Comp. Sci.
Elec. Eng.

Sparse Index Files
• Sparse Index: contains index records for only some search-key values. • Applicable when records are sequentially ordered on search-key (eg clustered)
• To locate a record with search-key value K we:
• Find index record with largest search-key value < K • Search file sequentially starting at the record to which the index record points Srinivasan Comp. Sci. Comp. Sci. Comp. Sci. Elec. Eng. Secondary Indices Example • Secondary index on salary field of instructor Srinivasan Comp. Sci. Comp. Sci. Comp. Sci. Elec. Eng. • Index record points to a bucket that contains pointers to all the actual records with that particular search-key value. • Secondary indices have to be dense Covering Index Idx Page.0 Attr0 Attr1... Adam 0.1 449 0.0 Bob 143 Anna 1.2 551 Bob 0.0 143 0.1 Adam 449 David 4.3 50 Gau 1.0 123 0.2 Jenny 457 0.3 Katie 454 Attr0 Attr1... 1.0 Gau 123 1.1 Joy 249 1.2 Anna 551 1.3 Min 44 Idx Page.1 Katie 0.3 454 Kami 8.2 987 Hao 4.0 578 Jenny 0.2 457 Joy 1.1 249 Problem #1 • Suppose: • We can hold 100 index entries per index page • Index pages are 4 kilobytes (kb) = 4000 bytes. • There are 100,000,000 tuples in the table we are indexing. • What is the size of a dense index in gigabytes (gb) assuming each page is fully filled? Problem #2 • Suppose: • We can hold 100 index entries per index page. • Index pages are 4 kilobytes (kb) = 4000 bytes. • 10ms to read a random block (tT+tS) • Dense sorted index built on 1,000,000 tuples • How many (binary) searches per second can you do on the index to find the RID (round down)? Programming Help
How could we do this faster?

What do you use for O(1) lookups?

Hash Function Basics
Maps values from a (potentially large or infinitely sized) domain to a smaller, fixed sized range.
The output values are indexes to individual “buckets”.
How does a hash function guarantee fixed output range?

Hash Based Indexes
H(David) = 0 H(Mo) = 1
Attr0 Attr1…
1.0 Gau 123
1.1 Joy 249
1.2 Anna 551
1.3 Min 44
Idx Page.1
Output Hash Buckets
Idx Page.0
Attr0 Attr1…
0.0 Bob 143
0.1 Adam 449
0.2 Jenny 457
0.3 Katie 454

Programming Help, Add QQ: 749389476
Hash Collisions
• When trying to insert a value in a hash bucket
• What do you do when your hash bucket is full? (Collision)
H1(Aaron) = 1 Idx.insert((“Aaron, 3.1), H(Aaron))
Many Options:
1. Chaining
2. Probing
3. Rehashing
4. Alternate Hashing Schemes…
Output Hash Buckets
Idx Page.1
Idx Page.0

• Create additional pages “chained” to the first one for a specific hash bucket
H1(Aaron) = 1 Idx.insert((“Aaron, 3.1), H(Aaron))
Idx Page.1
Idx Page 1.1
Output Hash Buckets
Idx Page.0

• Attempt to stick the record in the next available slot at offset 𝑖
• Linear Probing
• The next slots are explored as a
linear function of 𝑖
• Quadratic Probing
• The next slots are explored as a quadratic function of 𝑖
Output Hash Buckets

• If your chains are too long or index pages are filled to some size
• Abandon your existing hash function and reconstruct the index using a new hash function
• Typically double the hash bucket range when rehashing
Idx Page.0
Idx Page.1
Output Hash Buckets (H2)

How expensive is it?
• To find a record?
• Iftheindexkeyisunique?
• FindtherecordvsfindtheRID/pointer.
• To insert a record?

Indexes in PostgreSQL
Without an index
123 Awe Ave
‘addresses’ table
σ “#$ &’$ “###
Sequential Scan

Indexes in PostgreSQL
123 Awe Ave
σ “#$ &’$ “###
Index Scan on
IX_Addresses_id
With an ordered index
IX_Addresses_
‘addresses’ table

Indexes in PostgreSQL
With a hash index
IX_Addresses_city
Chicago 0.1
Hash Index Lookup on
IX_Addresses_city
123 Awe Ave
IX_Addresses_city
Mumbai 0.2
Zurich 5.3
‘ADDRESSES’ TABLE

Indexes in PostgreSQL
H(‘ZQ%’) = ??
With a hash index
IX_Addresses_city
Chicago 0.1
σ (&)* .&/0 ,-%
Sequential Scan
123 Awe Ave
IX_Addresses_city
Mumbai 0.2
Zurich 5.3
‘ADDRESSES’ TABLE

• Searching through pages can be expensive, especially when not stored in order
• Indexes can be built to find records faster
• Different types of indexes
• Indexes built to speed up certain types of lookups
• Single value lookups can be sped up using hash indices
• Range lookups can be sped up by ordered indices – next lecture

程序代写 CS代考加QQ: 749389476
Next Lecture
• Database Indexes II
• Readings: Database System Concepts: Chapter 14.3