CS124 pset5 sol

CS 124 Homework 5: Spring 2023
Your name: Jeffrey Wang, Anthony Cui, Chao Cheng Collaborators: An unnamed deity
No. of late days used on previous psets: 0
No. of late days used after including this pset: 0
Homework is due Wednesday April 5 at 11:59pm ET. You are allowed up to twelve (college)/forty (extension school), but the number of late days you take on each assignment must be a nonnegative integer at most two (college)/four (extension school).
Try to make your answers as clear and concise as possible; style may count in your grades. Assign- ments must be submitted in pdf format on Gradescope. If you do assignments by hand, you will need to scan your papers to turn them in.
You may not write anything that you will submit within one hour of collaborating with other students, language models, the internet, etc., or using notes from such sources. That is, whatever you submit must first have been in your head alone, or notes produced by you alone, for an hour. Subject to that, you can collaborate with other students, e.g. in brainstorming and thinking through approaches to problem-solving.
For all homework problems where you are asked to give an algorithm, you must prove the correctness of your algorithm and establish the best upper bound that you can give for the running time. Generally better running times will get better credit; generally exponential-time algorithms (unless specifically asked for) will receive no or little credit. You should always write a clear informal description of your algorithm in English. You may also write pseudocode if you feel your informal explanation requires more precision and detail, but keep in mind pseudocode does NOT substitute for an explanation. Answers that consist solely of pseudocode will receive little or no credit. Again, try to make your answers clear and concise.
1. PhyllisNotes Incorporated sells PhyllisNotesTM subscriptions to CS124 students at $78.50 per semester. Andrew, the CTO, runs the online platform where subscribed users can view PhyllisNotesT M . Each student can log in to their PhyllisNotesT M account using only a pass- word (Andrew has not discovered the concept of usernames). To combat intellectual property theft, Andrew has implemented a strict policy where each student can only log in from their own laptop. This policy is enforced via the use of a hash table, which hashes students’ pass- words and stores the IP address of the laptop logging in with that password. Students who log in to their PhyllisNotesTM accounts from someone else’s laptop will immediately have their accounts closed for violating the terms of service. Each entry of Andrew’s hash table stores (at most) one IP address, so in case of hash collisions, however, Andrew’s design con- tains a major flaw (not surprising for someone with no college degree1). If Student A creates and logs into an account using a password that hashes to the same value as the password of
1This problem was written by Andrew.

Student B, the website will mistake Student A for Student B, believe that Student B logged in from Student A’s laptop, and close Student B’s account. Assume that each student enters their own account using only their own laptop, and every student whose account has been wrongfully closed will file a lawsuit against PhyllisNotes Incorporated.
(a) (10 points) There are 262 students in CS124, exactly half of which have indicated in a survey that they will subscribe to PhyllisNotesTM. How large must Andrew’s hash table be so that the probability that PhyllisNotes Incorporated gets sued is at most 1%? (You’ll likely want to use a calculator or program.)
(b) (10 points) Unfortunately, as nearly all of the company’s data storage capacity is being used to store and back up PDFs of PhyllisNotesT M , Andrew is unable to build a large enough hash table as required in part (a). Instead, he will need to estimate the number of lawsuits that will be filed. Suppose there are m students this year who will subscribe to PhyllisNotesTM, and the hash table has n slots. Calculate the expected number of lawsuits that will be filed this year, as a function of m and n.
(a) This question is similar to the birthday problem. PhyllisNotes gets sued if and only if a hash collision occurs. The probability that this happens for a hash table of size n, assuming the outwardly-random and deterministic behavior of hash functions, is equal to one minus the probability that no collision occurs.
The first student will not sue Phyllis with probability one. For the next student, there is a n−1 chance that his/her password will hash to the same password as the first. This
continues, so the probability that no student will have a password collision is: 1· n−1 ····· n−130
To find how large Andrew’s hash table must be so that the probability that PhyllisNotes gets sued is at most 1%, we can find how large the table must be such that the probability that no collisions occur is more than 1 − 0.01 = 0.99.
Python yields 847279 as the minimum table size.
(b) There are many ways of thinking about this question. Here’s a clean one:
Every student will file a lawsuit except the first person in their slot, so the number of lawsuits is m minus the number of filled slots. Taking complements, the number of filled slots is n minus the expected number of empty slots.
Let I1 . . . In be indicator variables for whether each slot is empty. The expectation of an indicator variable is the probability that it’s 1, so E[I1] = P (slot 1 is empty) = (1− n1 )m. Then by linearity, the expected number of empty slots E[Pj Ij] is n(1 − n1 )m, so the expected number of lawsuits is
m−n+n(1− n1)m
Students may have an alternate solution: Let the m students represent balls, and the n table slots represent bins. For some x ≥ 1, if x balls land in any given bin y, there will be x−1 lawsuits (in the sequential process described in the question). As such, to find the expected number of

lawsuits, we can compute the expected number of bins with 2 balls, times 1, plus the expected number of bins with 3 balls, times 2, etc.
XE[Number of bins with exactly i balls]·(i−1) i=2
The expectation of an indicator is equal to the probability. With linearity of expectation, we have
thattheexpectednumberofbinswithexactlytwoballsisequalton∗Pr[Bin one has exactly two balls],
which is equal to n∗m∗(1)2 ∗(n−1)n−2. 2nn
Generalizing, we have that our sum above is equivalent to: Xnm1i 1m−i
there are at most c2
hash value is at least 1/2.
n· i ·(n) ·(1−n) ·(i−1) i=2
As we can see on Desmos, this is an identical expression!
2. Suppose each person gets an independent uniformly random hash value from the range [1 . . . n]. (For the case of birthdays, n would be 365.)
(a) (10 points) Show that for some constant c1, when there are at least c1
room, the probability that no two have the same hash value is at most 1/2. (Note that the hints below may be relevant for any part of the problem.)
(b) (10 points) Similarly, show that for some constant c2 (and sufficiently large n), when
Hints: You may wish to use the fact that
(1 − n1 )n < 1e < (1 − n1 )n−1 for all n > 1.You may want to also use the facts that e−x ≥1−x
n people in the room, the probability that no two have the same (c) (0 points, optional)2 Make these constants as close to optimal as possible.
You may feel free to find and use better bounds.
e−x−x2 ≤1−x forx≤12.
(a) Suppose that there are k people in the room. Then the probability of no two people
hashing to the same value is
1 2 k−1
1·1−n1−n…1−n
2We won’t use this question for grades. Try it if you’re interested. It may be used for recommendations/TF hiring.
n people in a

Using the hint that 1 − x ≤ e−x, we have
1 2 k−1 −1/n −2/n −(k−1)/n
1·1−n1−n…1−n≤1·ee…e ≤ e(−1 −2 −…−k−1)
Ifwechoosek=2 n,then,sincen≥1,k(k−1)≥(2 n)( n),soe 2n
nnn −k(k−1)
as desired, so c1 = 2 suffices.
(b) For this part, we go back to the earlier expression for the probability of no two people
(out of k) having the same hash:
1 2 k−1
1·1−n1−n…1−n Using the other hint that 1 − x ≥ e−x−x2 , we have
1 2 k−1 −1−(1)2 −2−( )2 −k−1−(k−1)2 1·1−1−…1−≥1·ennenn…enn
−k(k−1) − k(k−1)(2k−1) ≥e 2n 6n2
Since k = O( n) (and we only care, in this part, about asymptotics for large n), the above expression is asymptotically
If we choose any c2 < 2 ln 2, then for k < c2 n, that expression is only 1/2, as desired. 3. (a) (5 points) In our description of Bloom filters in class, we could only add and look up items. Suppose we try to implement deletion of an item from a Bloom filter by changing the corresponding elements to 0s. Give an example sequence of operations (adds, deletes, and lookups) for which this Bloom filter has both a false positive and a false negative. (b) (15 points) Counting Bloom filters are a variant of Bloom filters where you do not use an array of bits, but an array of small counters. Instead of changing 0s to 1s when an element hashes to a Bloom filter location, you increment the counter. To delete an element, you decrement the counter. To check if an element is in the set, you just check if all of the counter entries are bigger than 0; if they are, then you return the element is in the set. If you see a 0, you say that the element is not in the set. Deletions will work properly for a Counting Bloom filter as long as the counters never overflow; once a counter overflows, you would not have an accurate count of the number of elements currently hashed to that location. A standard choice in this setting in practice to use 4 bit counters. Find an expression for the probability that a specific counter overflows when using 4 bit counters when n elements are hashed into m total locations using k hash functions. (You may have a summation in your expression.) Calculate this probability when m/n = 8 and k = 5, for n = 1000, 10000, and 100000, and calculate the expected number of counters that overflow for each case. Programming Help, Add QQ: 749389476 (c) (5 points) Suppose that you use m counters for n elements and k hash functions for a Counting Bloom filter. What is the false positive probability (assuming there has never been an overflow), and how does it compare with the standard Bloom filter? (a) Consider the simple bloom filter with m = 3 slots and k = 2 hash functions. Now let’s define the three elements a, b, and c which have the hash values h(a) = [0,1], h(b) = [0,2], and h(c) = [1,2]. Then the following operations will give both a false positive and a false negative: Operation add(a) add(b) lookup(c) delete(b) lookup(a) Table state [1, 1, 0] [1, 1, 1] [1, 1, 1] [0, 1, 0] [0, 1, 0] False positive: will say c is in the table when it isn’t False negative: will say a is not in the table when it is (b) Interpretation 1: Using one hash table A 4-bit counter overflows if at least 16 of the nk total hashes performed are mapped to that counter. Since each hash value is chosen uniformly and independently, the probability that j hashes are mapped to a counter i is nk 1 j m − 1 nk−j jmm and so the probability that i overflows is P (counter i overflows) = X nk nk1jm−1nk−j j=16 j m m To avoid any numerical problems when calculating this summation, we can reduce the number of terms by rewriting it in terms of the complement, getting 15 nk1jm−1nk−j P(counter i overflows) = 1 − Xj=0 j m m To find the expected number of counters that overflow, we can set up indicator variables I1, . . . , Im to represent if the ith counter overflows. Then by linearity of expectation, = XE(Ii) i=1 = m · E(I1) = m · P (counter 1 overflows)  15 nk1jm−1nk−j =m 1−X jmm These formulas give the following values: ! E(overflowed counters) = E X Ii 1,000 10,000 100,000 P(counter i overflows) 1.408 ×10−17 1.436 ×10−17 1.439 ×10−17 E(overflowed counters) 1.126 ×10−13 1.149 ×10−12 1.151 ×10−11 Interpretation 2: Using k hash tables Much of the reasoning is the same as the first interpretation. The only difference is that a counter can only have at most n hashes mapped to it, and that each hash now can only take on m/k different values. This makes the probability and expected value 15 n 1 j m/k−1n−j P (counter i overflows = 1 − Xj=0 j m/k m/k 15 nkjm−kn−j =1−Xj=0jm m  15 nkjm−kn−j E(overflowed counters) = m · 1 − X jmm j=0 These formulas also give slightly different values: 1,000 10,000 100,000 P(counter i overflows) 1.288 ×10−17 1.424 ×10−17 1.438 ×10−17 E(overflowed counters) 1.030 ×10−13 1.139 ×10−12 1.150 ×10−11 (c) We can note that the only thing lookup(i) cares about is whether the relevant counters are zero or not. This means that if there has never been an overflow, the Counting Bloom filter behaves the exact same as the normal Bloom filter for the false positive probability, which is P (false positive) ≈ 1 − e−nk/mk 4. For the document similarity scheme described in class, it would be better to store fewer bytes per document. Here is one way to do this, that uses just 48 bytes per document: take an original sketch of a document, using 84 different permutations. Divide the 84 permutations into 6 groups of 14. Re-hash each group of 14 values to get 6 new 64 bit values. Call this the super-sketch. Note that for each of the 6 values in the super-sketch, two documents will agree on a value when they agree on all 14 of the corresponding values in the sketch. (a) (5 points) Why does it make sense to simply assume that this is the only time a match will occur? (b) (15 points) Consider the probability that two documents with resemblance r agree on two or more of the six sketches. Write equations that give this probability and graph the probability as a function of r. Explain and discuss your results. (c) (10 points) What happens if instead of using a 64 bit hash value for each group in the supersketch, we only use a 16 bit hash? An 8 bit hash? Computer Science Tutoring
(a) Two documents will agree on a super-sketch value (i.e. “match”) if either a) they share the same 14 numbers in the underlying sketch or b) a hash collision occurs. Since the probability of the latter is 2−64 ≈ 0 for a 64-bit hash function, we can assume for now that a super-sketch match occurs if and only if the underlying 14 values also match.
(b) The probability that two documents with resemblance r agree on each permutation is r, so the probability that they agree on any group of 14 permutations is r14. Let m be the number of sketches that match. Then the probability of m sketches matching is
6 14m 146−m P(m)=mr 1−r
and the probability of two or more sketches matching is
P (m ≥ 2) = 1 − P (1) − P (0)
6 141 145 6 140 146
=1−1r1−r−0r1−r = 1−6r141−r145 −1−r146
The graph of P(m ≥ 2) as function of r is shown below. As desired, the probability that there are more than two matches is extremely low until it increases very sharply at roughly r = 0.90. Thus, the documents won’t agree unless the resemblance is very high already, which is what we want. This is pretty similar to what happens under the normal sketch scheme, but uses much less space at the cost of slightly lower reliability (e.g. for a document similarity of r = 0.8, the super-sketch scheme will incorrectly report a match with probability 0.02589, whereas the original scheme of 100 permutations reports a 90- permutation match with probability 0.006). Whether greater accuracy or more space- efficient storage is better depends on the use case — consider, for instance, the task of caching the sketches of every single webpage on the internet.

Code Help
(c) Above, we assumed that the probability of a hash collision occurring when calculating the hashes of the sketch-groups was negligible. However, for 16-bit and 8-bit hashes, the chance of a collision is much greater, so we need to factor it into the probability expression in part (b). For a b-bit hash function, the chance of collision is 2−b, so the probability that two documents matching on a sketch is now r14 + 2−b(1 − r14). Thus, the probability of m sketches matching becomes
6 14 −b 14 m 14 −b 6−m Pb(m)= m r +2 (1−r ) (1−r )(1−2 )
and the probability of two or more sketches matching is then
Pb(m ≥ 2) = 1 − 6r14 + 2−b(1 − r14)(1 − r14)(1 − 2−b)5 − (1 − r14)(1 − 2−b)6
A table of probabilities for Pb(m ≥ 2) is shown below for 8-bit, 16-bit, and 64-bit hash functions across a range of r values.
0.10 0.50 0.75 0.90 0.95 0.99
8-bit 2.3e-4 2.3e-4
0.0066 0.4224 0.8806 0.9998
16-bit 64-bit
8.7e-8 0.0045 0.4151 0.8786 0.9998
≈0 5.6e-8 0.0045 0.4151 0.8786 0.9998
These are very similar, so we’d probably be happy to save a factor of 8 on space at the cost of a slight increase in error rate.
5. Recall the fingerprinting algorithm presented in lecture 15 for determining whether a pattern of length k appears in a document of length n characters, each a number between 0 and 9: we hash the pattern by taking the decimal number it represents mod p; we hash each substring of length k of the document by taking the decimal number it represents mod p, and, for every hash match, we compare the corresponding place in the document to the pattern, character by character, stopping if we find an instance of the pattern. For instance, comparing [3,1,4,1,5] to [3,1,4,1,5] takes 5 comparisons of characters (and then we can stop the algorithm, having found an occurrence of the pattern), and comparing [3,1,4,1,5] to [3,1,4,2,8] takes 4 comparisons of characters ([3,1,4,x]) (but we must move on to checking any other hash matches, in case one of them is a match).
Instead of choosing a random prime number for p, say that we use some fixed prime number p that is made public. Sara knows p and wants to slow down your algorithm, so she is trying to construct a document and pattern that force you to do many comparisons of characters.
(a) (5 points) Suppose p = 1249, the pattern P is [3,1,4,1,5,9,2,6] (so k = 8), and the document D is [1,4,6,8,3,4,8,3,1,9,4,1,6,7,8,3,9,3,3,1,4,1,7,1,7,5,1,1,3,2] (so n = 30). The algorithm makes 11 comparisons of characters. What are those compar- isons?

(b) (12 points) Give an upper bound on the number of comparisons of characters that the algorithm performs in terms of the length n of the document, the length k ≤ n of the pattern, and the prime p.
For full credit, your bound should be asymptotically tight for large n: that is, there should exist a constant c such that if n is large enough (where “large enough” may depend on k and p), your answers to this and the next question should differ by at most a factor of c. Correctly-proven bounds that don’t quite match are eligible for partial credit.
(c) (12 points) Give a family of examples (each consisting of a document, a pattern, and p) that require asymptotically as many comparisons of characters as possible.
(d) (0 points, optional)3 Make your answers to the previous two questions asymptotically tight in terms of p, n, and k: that is, there should exist a constant c such that for all sufficiently large n, k, and p your answers are within a factor c of each other. For instance, an upper bound of O(n) and examples achieving Ω(n − k) would satisfy the requirements for full credit, but not for this optional problem.
For full credit, your bound should be tight: that is, your answers to the above questions should asymptotically match and be correct. Correctly-proven bounds that don’t quite match are eligible for partial credit.
1. The underlined letters in the following words are compared; there are a total of 11 compar- isons. See here for code.
String (comparisons underlined) 46834831
Number of Comparisons 1
2. An asymptotic upper bound for the number of comparisons is one that is greater than or equal to the true number of comparisons for any given n, k, and p.
Consider a document of length n. Taking patterns of length k, we hash them by modding them by p. We then check them, conducting a total of n − k + 1 checks. Every check requires at most k comparisons, so, we do at most (n−k+1)∗k comparisons in total. Since this bound takes into account every possible comparison that could occur for n, k, p, it is an asymptotic upper bound. This is O(nk).
3. Consider the prime p = 2, the pattern consisting of k − 1 0s followed by a 2, and a document consisting of n 0s.
Then the pattern is 0 mod 2 and every substring of length k is 0 mod 2, so we must check every substring. Each substring matches the pattern in the first k − 1 digits, so we must check
3We won’t use this question for grades. Try it if you’re interested. It may be used for recommendations/TF hiring. 9

all k digits. Hence we do (n − k)k comparisons, asymptotically matching the bound in the previous part.