CIS 5450 Homework 2: SQL
Due: Wednesday, October 11 2023, 10:00 pm EST
Worth 100 points in total (90 Points Autograded + 10 Points Manually Graded)
Welcome to Homework 2! By now, you should be familiar with the world of data science and the Pandas library. This assignment focuses on helping you get to grips with a new tool: SQL.
Through this homework, we will be working with SQL (specifically pandasql) by exploring a Yelp dataset containing business, checkin, tip, and user data. We will also conduct some text analysis.
We are introducing a lot of new things in this homework, and this is often where students start to get lost. Thus, we strongly encourage you to review the slides/material as you work through this assignment. Before you begin:
Be sure to click “Copy to Drive” to make sure you’re working on your own personal version of the homework
Check the pinned FAQ post on Ed for updates! If you have been stuck, chances are other students have also faced similar problems.
WARNING: You MUST check that your notebook displays ALL visualizations on the Gradescope preview AND verify that the autograder finishes running and gives you your expected score (not a 0). (Ed #251).
Penalty: -10: if we have to resubmit your notebook to Gradescope for you after the deadline. (e.g. not naming your files correctly, not submitting .py and . ipynb , etc.).
Penalty: -5: failing to restart and re-run your notebook and ensure that that all visualizations show up in the Gradescope preview of your .ipynb (see step 1 and 2 in screenshot below for how to check this) (e.g. Large File Hidden Error ).
Note: If your plot is not run or not present after we open your notebook, we will deduct the entire manually graded point value of the plot. (e.g. if your plot is worth 4 points, we will deduct 4 points).
Note: If your .py file is hidden because it’s too large, that’s ok! We only care about your .ipynb file.
Part 0: Libraries and Set Up Jargon (The usual wall of imports)
%set_env HW_ID=CIS5450_F23_HW2
print(pd.__version__ ) PennGrader Setup
In [ ]: In [ ]:
In [ ]: In [ ]:
!pip install penngrader-client
!pip install sqlalchemy==1.4.46 !pip install pandasql
!pip install geopy
!pip install -U kaleido
from penngrader.grader import *
import pandas as pd
import datetime as dt
import geopy.distance as gp
import matplotlib.image as mpimg
import plotly.express as px
import pandasql as ps #SQL on Pandas Dataframe import nltk
nltk.download(‘punkt’)
from wordcloud import WordCloud
from matplotlib.dates import date2num import matplotlib.pyplot as plt import math
import json
from collections import Counter import random
# Five datasets we’re using
! wget -nc https://storage.googleapis.com/penn-cis5450/yelp_business.csv ! wget -nc https://storage.googleapis.com/penn-cis5450/yelp_checkin.csv ! wget -nc https://storage.googleapis.com/penn-cis5450/yelp_tip.csv
! wget -nc https://storage.googleapis.com/penn-cis5450/yelp_user.csv
! wget -nc https://storage.googleapis.com/penn-cis5450/hotel_reviews.csv
# PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY.
# IF NOT, THE AUTOGRADER WON’T KNOW WHO TO ASSIGN POINTS TO YOU IN OUR BACKEND # TODO: YOUR PENN-ID GOES HERE AS AN INTEGER
STUDENT_ID = 99999999
# You should also update this to a unique “secret” just for this homework, to # authenticate this is YOUR submission
SECRET = STUDENT_ID
Leave this cell as-is…
%%writefile notebook-config.yaml
grader_api_url: ‘https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23’
grader_api_key: ‘flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa’
grader = PennGrader(‘notebook-config.yaml’, os.environ[‘HW_ID’], STUDENT_ID, SECRET)
Yelp Dataset
I’m sure you have used Yelp when searching for good restaurants or reliable local services, and with good reason. In this homework, we’ll explore Yelp’s extensive datasets to gain valuable insights into businesses and users on the platform. The data we will be using includes:
Business: data about businesses listed on Yelp
Check-in: data about check-ins made on businesses
Tip: data about tips that users left for each business
User: data about each user on Yelp including friends and number of reviews
We’ll be parsing this data into dataframes and relations, and then exploring how to query and assemble the tables into results. We will primarily be using PandaSQL, but for some of the initial questions, we will ask you to perform the same operations in Pandas as well, so as to familiarize you with the differences and similarities of the two.
For the final part of this homework, we’ll perform some text analysis on a hotel review dataset.
Part 1: Load & Process our Datasets [12 points total]
Before we get into the data, we first need to load and clean our datasets.
You’ll be working with four CSV files:
yelp_business.csv
yelp_checkin.csv
yelp_tip.csv
yelp_user.csv
The file yelp_business.csv contains data of businesses listed on Yelp such as name, address, stars, review count, etc. The file yelp_checkin.csv contains check-in data for businesses including day, hour, and number of people.
The file yelp_tip.csv contains tips that each user left for each business.
The file yelp_user.csv contains data of each user such as name, review count, friends, start date of membership, and etc. TODO:
Load yelp_business.csv and save the data to a dataframe called business_df . Load yelp_checkin.csv and save the data to a dataframe called checkin_df . Load yelp_tip.csv and save the data to a dataframe called tip_df .
Load yelp_user.csv and save the data to a dataframe called user_df .
# TODO: Import the datasets to pandas dataframes — make sure the dataframes are named correctly!
# TODO: view business_df (just the first 5 rows with .head() is fine) to make sure the import was successful
# TODO: view checkin_df (just the first 5 rows with .head() is fine) to make sure the import was successful
# TODO: view tip_df (just the first 5 rows with .head() is fine) to make sure the import was successful
# TODO: view user_df (just the first 5 rows with .head() is fine) to make sure the import was successful
1.1 Data Preprocessing
Next, we are going to want to clean up our dataframes. We will start with cleaning up business_df .
1.1.1 Cleaning business_df [4 points]
.info() gives us meaningful information regarding columns, their types, and the amount of nulls, based on which we can now clean our dataframe.
Perform these steps and save results on a new dataframe: business_cleaned_df TODO:
Copy business_df to the new dataframe called business_cleaned_df
Remove extra quotation marks from the entries in the columns name and address in business_cleaned_df
Convert the column categories into a list of strings
Drop neighborhood column from business_cleaned_df
Create a column called is_restaurant where the value is 1 if ‘Restaurants’ is in the list in column categories and is 0 otherwise
Create individual rows for each category of each business instead of having a list of categories (Hint: explode)
Sort business_cleaned_df by business_id and categories respectively in an ascending order and reset index as well as drop old indices Create a list called business_category_list containing all unique values found in column categories and sort the list in an alphabetical order
After performing these steps, business_cleaned_df should have the following schema: Final Schema:
business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | categories | is_restaurant — | — | — |— | — | — | — |— | — | — |— |— |—
NOTES: MAKE SURE TO ALWAYS RESET INDEX. OTHERWISE, YOU MIGHT GET AN INTERNAL SERVER ERROR WHEN RUNNING THE TEST CASES!!!
# View info of business_df
business_df.info()
# TODO: Save business_df to business_cleaned_df
business_cleaned_df =
# TODO: Remove extra quotations from name and address
# TODO: Convert categories into a list of strings
# TODO: Drop neighborhood column
# TODO: Create a column called is_restaurant where the value is 1 if ‘Restaurants’ is within the list in the column categories and is 0 otherwise
# TODO: Create individual rows for each category of each business (Hint: explode)
# TODO: Sort business_id and categories by ascending order
1.1.2 Cleaning checkin_df [3 points]
We will inspect checkin_df through the following steps: TODO:
Save checkin_df to the new dataframe called checkin_cleaned_df
Inspect the values in weekday column. Store the number of unique values for the column weekday in the variable called weekday_count_checkin
Inspect the values in hour column. Store the number of unique values for the column hour in the variable called hour_count_checkin
We would like to know the maximum number of check-ins received by a business in the dataframe. Store this value in the variable called max_checkin_single
After performing these steps, checkin_cleaned_df should have the following schema (should be the same as checkin_df ): Final Schema:
# TODO: Create a list called business_category_list containing all unique values found in column categories and sort the list in an alphabetical order
business_category_list =
# 4 points
grader.grade(test_case_id = ‘test_cleaning_business’, answer = (len(business_category_list), business_cleaned_df.head(5)))
business_id | weekday | hour | checkins — | — | — |—
# View info of checkin_df
checkin_df.info()
# TODO: Save checkin_df to checkin_cleaned_df
checkin_cleaned_df =
# TODO: Store the number of unique values for the column weekday in the variable called weekday_count_checkin
weekday_count_checkin =
# TODO: Store the number of unique values for the column hour in the variable called hour_count_checkin
hour_count_checkin =
# TODO: Store the maximum number of check-ins for a single business in the variable called max_checkin_single
max_checkin_single =
# 3 points
grader.grade(test_case_id = ‘test_cleaning_checkin’, answer = (weekday_count_checkin, hour_count_checkin, max_checkin_single))
1.1.3 Cleaning tip_df [2 points]
We will clean tip_df through the following steps: TODO:
Save tip_df to the new dataframe called tip_cleaned_df
Drop all rows with NA
Convert the data type in date column from string to datetime64[ns] (Use default format: ‘%Y-%m-%d’ ) Sort tip_cleaned_df by business_id and user_id in an ascending order and reset index
After performing these steps, tip_cleaned_df should have the following schema (should be the same as tip_df ): Final Schema:
text | date | likes | business_id | user_id — | — | — |— |—
# TODO: Save tip_df to tip_cleaned_df
# TODO: Drop rows with NA
# TODO: Convert the data in date column from string to datetime64[ns]
# TODO: Sort tip_cleaned_df by business_id in an ascending order and reset index
1.1.4 Cleaning user_df [3 points]
To understand what each column means, please refer to this documentation (Note that the original datasets were in JSON format). Perform these steps and save results on a new dataframe: user_cleaned_df
Save user_df to the new dataframe called user_cleaned_df
Create a column called has_friend which contains a value of 1 if the user has at least 1 friends and contains a value of 0 otherwise
Drop unwanted columns so we only have the following columns in user_cleaned_df : user_id, name, review_count, yelping_since, friends, elite, average_stars, has_friend Sort user_cleaned_df by user_id in an ascending order and reset index
After performing these steps, user_cleaned_df should have the following schema: Final Schema:
user_id | name | review_count | yelping_since | friends | elite | average_stars | has_friend — | — | — |— | — | — |— |—
# View info of tip_df
tip_df.info()
# 2 points
grader.grade(test_case_id = ‘test_cleaning_tip’, answer = (tip_cleaned_df.head(5), len(tip_cleaned_df)))
Do not convert the friends column into a list of strings since the table is very large, and performing such an operation could lead to a depletion of your RAM. We will address this column later, but only on a smaller subset of the table. ‘friends’ column is of type string, and ‘None’ should not be interpreted as NA but rather a string ‘None’
# TODO: Save user_df to user_cleaned_df
# TODO: Create a column called has_friend which contains a value of 1 if the user has friends and 0 otherwise
# TODO: Drop unwanted columns
# TODO: Sort by user_id in an ascending order
1.2 Your Sandbox
.info() is just one of many basic tools that you can use for Exploratory Data Analysis (EDA). Instead of throwing you straight into the deep end, we wanted to give you a chance to take some time and explore the data on your own. This section is not graded, so for the speedrunners out there feel free to just jump in, but we wanted to at least give you a small space to utilize your EDA toolkit to familiarize yourself with all the data you just downloaded.
Some suggestions to get you started:
df.describe()
Series.unique()
# Your EDA here! Feel free to add more cells
Part 2: Exploring the Data with PandasSQL (and Pandas) [74 points total]
Now that you are familiar (or still unfamiliar) with the dataset, we will now introduce you to SQL, or more specifically pandasql: a package created to allow users to query pandas DataFrames with SQL statements. Notes: Please note that in this part, the questions are not necessarily ordered by increasing difficulty, i.e., the first question may not necessarily be the easiest, and the difficulty level may vary.
IMPORTANT: Pay VERY CLOSE attention to this style guide!
The typical flow to use pandasql (aliased as ps ) is as follows: 1. Write a SQL query in the form of a string
String Syntax: use triple quotes “””
Aliases are your friend: if there are very long table names or you find yourself needed to declare the source (common during join tasks), it’s almost always optimal to alias your tables with short INTUITIVE alias names New Clauses New Line: each of the main SQL clauses ( SELECT , FROM , WHERE , etc.) should begin on a new line
Use Indentation: if there are many components for a single clause, separate them out with new indented lines.
Example below:
2. Run the query using ps.sqldf(your_query, locals())
Pandasql is convenient in that it allows you to reference the dataframes that are currently defined in your notebook, so you will be able to fully utilize the dataframes that you have created above!
Given that it is a brand new language, we wanted to give you a chance to directly compare the similarities/differences of the pandas that you already know and the SQL that you are about to learn. Thus, for each of the simpler queries, we may ask that you look into the question twice: once with pandas and once with pandasql. The SQL queries may take a minute to run, don’t worry that is normal!
2.1 Location, Location, Location [22 points]
In the world of real estate, the phrase “location, location, location” has long been a mantra, emphasizing the importance of where a property is situated. Surprisingly, this mantra holds true not just for real estate but also for the culinary world. When it comes to restaurants, location can make or break a business. In this analysis, we’re going to take a deep dive into the intersection of restaurants and their locations.
2.1.1 What are the cities with the most restaurants? [6 points]
The dataframe business_cleaned_df contains information of each business. We want to know the top 10 cities with the most unique restaurants. Note that we can distinguish restaurants by their business_id , i.e., business_id are unique to each restaurant.
Use pandas to find top 10 cities in the city and state column with the most restaurants. Note that you should also include state in groupby since city names could repeat, i.e., city together with state are unique Include average stars of restaurants in each city in the column called average_stars
Order the resulting dataframe by the total number of restaurants, restaurant_count , in descending order
Save the result as most_res_df and don’t forget to reset index
Final Schema:
city | state | restaurant_count | average_stars — | — | — |—
There are two indicators whether the business is a restaurant or not: is_restaurant equals 1
categories equals ‘Restaurants’
Think about how we cleaned business_df to figure out the best way to filter the restaurants!
# TODO: Use pandas to find top 10 cities (including state) with the most restaurants. Also include average stars
Now, use pandasql to find top 10 cities with the most restaurants
The result should be saved as most_res_df_sql and should be the same as the result from pandas
WARNING: DO NOT USE PANDAS FOR ALL SQL QUESTIONS! OTHERWISE, YOU WON’T RECEIVE CREDITS FOR ALL SQL QUESTIONS
In [ ]: In [ ]:
In [ ]: In [ ]:
# View info of user_df
user_df.info()
# 3 points
grader.grade(test_case_id = ‘test_cleaning_user’, answer = (user_cleaned_df.head(5), len(user_cleaned_df)))
SELECT ltn.some_id, SUM(stn.some_value) AS total
FROM long_table_name AS ltn
INNER JOIN short_table_name AS stn
ON ltn.common_key = stn.common_key
INNER JOIN med_table_name AS mtn
ON ltn.other_key = mtn.other_key
WHERE ltn.col1 > value
AND stn.col2 <= another_value
AND mtn.col3 != something_else
GROUP BY ltn.some_id
ORDER BY total
# 3 points
grader.grade(test_case_id = 'most_res_pd', answer = most_res_df)
# TODO: Use SQL to find top 10 cities (including state) with the most restaurants. Also include average stars
most_res_query = '''
most_res_df_sql = ps.sqldf(most_res_query, locals()) most_res_df_sql.head()
# 3 points
grader.grade(test_case_id = 'most_res_sql', answer = (most_res_df_sql, most_res_query))
2.1.2 Which cities have the highest average check-ins per restaurant? [8 points]
The dataframe checkin_cleaned_df contains check-in information for each business. We want to know the top 15 cities with the highest average check-ins per restaurant. TODO:
Use pandas to find top 15 cities in the city and state column with the highest average check-ins per restaurant. Note that you should also include state in groupby since city names could repeat, i.e., city together with state are unique
Include average check-ins per restaurant in each city in the column called avg_checkins
Order the resulting dataframe by avg_checkins in descending order
Save the result as most_checkin_df and don't forget to reset index Notes:
Only consider businesses with check-in data in checkin_cleaned_df . Do not include businesses with no check-in data in the average.
One business_id can appear in more than one row in business_cleaned_df and checkin_cleaned_df . Think about how we can filter restaurants from other types of business without taking duplicates.
Final Schema:
city | state | avg_checkins --- | --- | ---
# TODO: Use pandas to find 15 cities (including state) with highest average checkins per restaurant
# 4 points
grader.grade(test_case_id = 'most_checkin_pd', answer = most_checkin_df)
Now, use pandasql to find top 15 cities with the highest average check-ins per restaurant
The result should be saved as most_res_df_sql and should be the same as the result from pandas
Use business_cleaned_df to find Thai restaurants with the highest review_count for each state
The resulting dataframe should include the state and city in which the restaurant is located, the name of the restaurant, the number of reviews in the max_review column, and the stars of the restaurant If there are more than one Thai restaurants in the same state with the maximum number of reviews, include all of them in the resulting dataframe
Save the result as popular_thai_df and don't forget to reset index
Finally, order the resulting dataframe by max_review in a descending order and name in an ascending order respectively
Final Schema:
state | city | name | max_review | stars --- | --- | --- | --- | ---
# TODO: Use pandas to find Thai restaurants with highest review count for each state
# TODO: Use SQL to find 15 cities (including state) with highest average checkins per restaurant
most_checkin_query = '''
most_checkin_df_sql = ps.sqldf(most_checkin_query, locals()) most_checkin_df_sql.head()
# 4 points
grader.grade(test_case_id = 'most_checkin_sql', answer = (most_checkin_df_sql, most_checkin_query))
2.1.3 Which states have the most popular Thai restaurants? [8 points]
We would like to know which state has the most popular Thai restaurants, which we will gauge using the number of reviews.
In [ ]: In [ ]:
# 4 points
grader.grade(test_case_id = 'popular_thai_pd', answer = popular_thai_df)
Now, use pandasql to find Thai restaurants with highest review count for each state
The result should be saved as popular_thai_df_sql and should be the same as the result from pandas
In [ ]: In [ ]:
TO-DO : Repeat the same using business_cleaned_df (only for this sql query) and pandasql , saving it as best_italian_df_sql . In [ ]:
In [ ]: In [ ]:
# TODO: Use SQL to find Thai restaurants with highest review count for each state
popular_thai_query = '''
popular_thai_df_sql = ps.sqldf(popular_thai_query, locals()) popular_thai_df_sql.head()
# 4 points
grader.grade(test_case_id = 'popular_thai_sql', answer = (popular_thai_df_sql, popular_thai_query))
2.2 Best Time to Visit Restaurant [14 points]
2.2.1 Finding restaurants [6 points]
After working hard on this course, you were able to find a summer internship position in Phoenix, AZ. You want to explore food places and you decided to start with Italian restaurants. We will use business_cleaned_df to find the top 10 Italian restaurant in Phoenix, AZ that has the highest review_count .
TODO ( pandas ):
Filter the business_cleaned_df on the given conditions.
Keep only business_id , name , stars , review_count , and categories for the final dataframe Sort by the review_count descending order
Final Schema:
business_id | name | stars | review_count | categories --- | --- | --- | --- | ---
# TODO: Finding restaurants
best_italian_df =
# 2 points
grader.grade(test_case_id = 'test_italian', answer = best_italian_df)
# TODO: Use SQL to find top 10 Italian restaurants with highest review count in Pheonix, AZ
best_italian_df_query = '''
best_italian_df_sql = ps.sqldf(best_italian_df_query, locals()) best_italian_df_sql.head()
# 4 points
grader.grade(test_case_id = 'test_italian_sql', answer = (best_italian_df_query,best_italian_df_sql))
2.2.2 Avoid Restaurant Traffic [8 points]
You visited one of the Italian restaurants from the 2.2.1 list. But, you weren't able to find a table to seat because the restaurant was very crowded at the time of your visit. Now, you want to try Burgers instead and want to find a better time to visit so you don't have to wait.
Use business_cleaned_df and to find the top 5 Burgers restaurant by review counts in Phoenix, AZ. checkin_cleaned_df will be used to find the number of check-ins at specific day and hour.
Keep only the rows that have less than 5 checkins at given day and hour Final Schema:
name | address | city | state | stars | review_count | categories | weekday | hour | checkins --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
# TODO: Avoid Restaurant Traffic
best_burgers_df =
# 3 points
grader.grade(test_case_id = 'test_burgers', answer = best_burgers_df)
Now, use pandasql to find the top 5 Burgers restaurants in Phoenix, AZ with less than 5 check-ins. The result should be saved as best_burgers_sql and should be the same as the result from pandas
Beware of the datatype of elite when dealing with not elite users Final Schema:
user_id | name | review_count | yelping_since | friends | elite | average_stars | has_friend | --- | --- | --- | --- | --- | --- | --- | --- |
# TODO: Use pandas to find elite yelp reviewers
# TODO: Use SQL to find the number of checkins that are less than 5 for # top 5 Burgers restaurants with highest review count best_burgers_query = '''
best_burgers_sql = ps.sqldf(best_burgers_query, locals()) best_burgers_sql.head()
# 5 points
grader.grade(test_case_id = 'test_burgers_sql', answer = (best_burgers_query,best_burgers_sql))
2.3 Avid Yelpers [22 points]
In this section we'll be taking a deeper dive into user_cleaned_df . We'll be focusing on Yelp users who are particularly active! In this digital age, even Yelp has a friends/follower count . Let's see who's friends with who, and who reigns supreme on the newest social media platform -- Yelp!
2.3.1 Elite Yelp Reviewers [6 points]
Use user_cleaned_df to find all elite Yelp users who have at least have one friend and have left at least 2000 reviews. TODO:
Filter for the above characteristics
Sort the dataframe by review_count in descending order, then by yelping_since in ascending order Reset index
Save the resulting dataframe as elite_user_df
In [ ]: In [ ]:
In [ ]: In [ ]:
# 2 points
grader.grade(test_case_id = 'elite_user_pd', answer = (elite_user_df.head(), len(elite_user_df)))
Now, use pandasql to find all elite Yelp users who have at least have one friend and have left at least 2000 reviews. The result should be saved as elite_user_df_sql and should be the same as the result from pandas
## TODO: Use SQL to find elite yelp reviewers
elite_user_query = '''
elite_user_df_sql = ps.sqldf(elite_user_query, locals()) elite_user_df_sql.head()
# 4 points
grader.grade(test_case_id = 'elite_user_sql', answer = (elite_user_df_sql.head(), elite_user_query, len(elite_user_df_sql)))
2.3.2 Joining Users' Friends [8 points]
We now want to find the friends of all the elite users. Before we do any analysis, let's clean up the friends column. TODO:
Use elite_user_df and create a new row for each friend an elite user has (hint split the strings then explode, make sure there is no trailing whitespace) This modified elite_user_df will be used for the rest of 2.3.2
# TODO: Explode the friends column of elite_user_df
Now we will merge the friends' information in pandas.
Use elite_user_df and user_cleaned_df to join the friends' name and average_stars
Keep the elite user's user_id , name , and average_stars . You can drop all other columns
Rename the friend columns to user_id_friend , name_friend , and average_stars_friend (hint: check out the parameters of merge)
Sort the dataframe by user_id ascending, then by user_id_friend ascending Save the resulting dataframe as friends_df
Final Schema:
user_id | name | average_stars | user_id_friend | name_friend | average_stars_friend | --- | --- | --- | --- | --- |--- |
# TODO: Use pandas to join elite yelp reviewers' friends
# 2 points
grader.grade(test_case_id = 'friends_join_pd', answer = (friends_df.head(10), len(friends_df)))
Now, use pandasql to merge elite users' friends' information
The result should be saved as friends_df_sql and should be the same as the result from pandas
Use friends_df to calculate the difference between each elite user's average_stars and the average of all of their friends' average_stars ( average_stars - average_stars_friend ). Name this column average_stars_diff
Save the results in stars_diff_df . Keep only the user_id and average_stars_diff columns. Final Schema:
user_id | average_stars_diff | --- | --- |
# TODO: Use pandas to calculate the difference in average_stars amongst a user and their friends
# TODO: Use SQL to join elite yelp reviewers' friends
friends_join_query = '''
friends_df_sql = ps.sqldf(friends_join_query, locals()) friends_df_sql.head()
# 4 points
grader.grade(test_case_id = 'friends_join_sql', answer = (friends_df_sql.head(), friends_join_query, len(friends_df_sql)))
2.3.3 Do Friends Think Alike? [8 points]
Now that we joined all of the elite users' friends, we want to compare their average ratings. Let's take a look at the difference between elite users' average stars and the average of all their friends' stars.
In [ ]: In [ ]:
# 2 points
grader.grade(test_case_id = 'stars_diff_pd', answer = stars_diff_df)
Now, use pandasql to to calculate the difference in average_stars amongst a user and their friends The result should be saved as stars_diff_df_sql and should be the same as the result from pandas
In [ ]: In [ ]:
Now, use pandasql to to calculate the average ratings of users who leave Yelp reviews on restaurants
The result should be saved as average_rating_tip_df_sql and should be the same as the result from pandas
(Don't worry this will take a minute).
Use tip_cleaned_df and business_cleaned_df to find the count of reviews for each category Keep the categories that have more than 100 reviews
Sort the dataframe descending by count
Save the results in category_tip_df
Some categoties with the restaurant tag may not seem related-- that's okay. Take a look at this documentation for help with datetime objects
Final Schema:
categories |count | --- |--- |
# TODO: Use pandas to calculate the count of restaurant reviews in January by category
# TODO: Use SQL to calculate the difference in average_stars amongst a user and their friends
stars_diff_query = '''
stars_diff_df_sql = ps.sqldf(stars_diff_query, locals()) stars_diff_df_sql.head()
# 4 points
grader.grade(test_case_id = 'stars_diff_sql', answer = (stars_diff_df_sql, stars_diff_query))
2.4 Restau