CIS5450 Homework 1

CIS5450 Homework 1

CIS 5450 Homework 1: Data Wrangling and Cleaning (Fall 2023)¶
Due: Wednesday, September 20th, 10:00 PM EST

Hello future data scientists and welcome to CIS 5450! In this homework, you will familiarize yourself with Pandas 🐼! The cutest animal and one of the essential libraries for Data Science. This homework is focused on one of the most important tasks in Data Science, preparing datasets so that they can be analyzed, plotted, used for machine learning models, etc…

This homework will be broken into analyzing several datasets across three sections and a fourth section focusing on XPath!

Working with Amazon Prime Video Data to understand the details behind its movies

Working on merged/joined versions of the datasets (more on this later though).

IMPORTANT NOTE: Before starting, you must click on the “Copy To Drive” option in the top bar. This is the master notebook so you will not be able to save your changes without copying it ! Once you click on that, make sure you are working on that version of the notebook so that your work is saved

Run the following 4 cells to setup the notebook

%set_env HW_ID=CIS5450_F23_HW1

!pip install penngrader-client

from penngrader.grader import *
import pandas as pd
import numpy as np
import seaborn as sns
from string import ascii_letters
import matplotlib.pyplot as plt
import datetime as dt
import requests
from lxml import html
import math
import json

!wget -nc https://storage.googleapis.com/penn-cis5450/credits.csv
!wget -nc https://storage.googleapis.com/penn-cis5450/titles.csv

What is Pandas?¶

Apart from animals, Pandas is a Python library to aid with data manipulation/analysis. It is built with support from Numpy. Numpy is another Python package/library that provides efficient calculations for matrices and other math problems.

Let’s also get familiarized with the PennGrader. It was developed specifically for 545 by a previous TA, Leonardo Murri.

PennGrader was developed to provide students with instant feedback on their answer. You can submit your answer and know whether it’s right or wrong instantly. We then record your most recent answer in our backend database. Let’s try it out! Fill in the cell below with your 8-digit Penn ID and then run the following cell to initialize the grader.

# PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY.
# IF NOT, THE AUTOGRADER WON’T KNOW WHO TO ASSIGN POINTS TO YOU IN OUR BACKEND
# YOUR PENN-ID GOES HERE AS AN INTEGER
STUDENT_ID = 99999999

# You should also update this to a unique “secret” just for this homework, to
# authenticate this is YOUR submission
SECRET = STUDENT_ID

Leave this cell as-is…

%%writefile notebook-config.yaml

grader_api_url: ‘https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23’
grader_api_key: ‘flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa’

grader = PennGrader(‘notebook-config.yaml’, os.environ[‘HW_ID’], STUDENT_ID, SECRET)

Pandas (the animal) are lazy. Their days are made up of eating and sleeping. Just like mine. Let’s run a cell just to make sure PennGrader works.

PennGrader Check [2 points]¶
Change favorite_activity to whichever panda “activity” you prefer. You should assign just one of the activities to the variable. (2 points)

Note: We’ll use cells like these “TODO” above to indicate what is important to have in each section of the notebook. Some general guidelines:

You don’t have to do all of these in one cell/step; we’re just labeling them for each section which might have smaller sub-sections (for example, look at how 1.2 is set up for your reference).
Make sure to read these carefully and do everything that is asked.
Make sure to run all the PennGrader test cells; if we forgot anything, please let us know ASAP on Ed Discussion so that we can update the Markdown cells here.

# In this cell, put which panda activity you prefer in lowercase – either eating or sleeping
# Input activity name in all lowercase
favorite_activity =

# Run this cell to submit to PennGrader!

# [CIS 545 PennGrader Cell] – 2 points
grader.grade(test_case_id = ‘panda_test’, answer = favorite_activity)

You just had your first experience with the Penn Grader! For the future questions, once you have completed a question, you can submit your answer to the Penn Grader for immediate feedback. Awesome, right?

We will use scores from Penn Grader to determine your grade. You will still need to submit your notebook so we can check for cheating and plagarism. Do not cheat.

Note: If you run Penn Grader after the due date for any question, your assignment will be marked late, even if you already had full points for the question before the deadline. To remedy this, if you’re going to run your notebook after the deadline, either do not run the grading cells, or reinitialize the grader with an empty or clearly fake ID such as 999999999999 (please use 10+ digits to be clearly a fake STUDENT_ID)

Adding our data so that our code can find it¶
We can’t be data scientists without data! We provided code for you to download the data (the “wget” cell from earlier). If you go to the view on the left and click files, you should see something similar to this image

Part 1: Working with Amazon Prime Video Data [42 points]¶
In this part of the homework we will be working with a dataset focused on Amazon Prime Video Movie Data!

1.0 Loading in Titles data (2 points)¶
Let’s first load our dataset into a Pandas Dataframe. Use Pandas’s read_csv functionality, which you can find documentation for here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

While reading documentation is hard at first, we strongly encourage you to get into the habit of doing this, since many times your questions will be answered directly by the documentation (ex: “why isn’t my dataframe dropping duplicates” or “why didn’t this dataframe update”).

Save the Credits dataframe to a variable named: credits_df
Save the Titles dataframe to a variable named: titles_df

#TODO: Import your two files to pandas dataframes — make sure the dataframes are named correctly!
credits_df =
titles_df =

Let’s focus on the titles_df for now and see what the dataframe looks like. Display the first 10 rows of the dataframe in the cell below (take a look at the documentation to find how to do this!)

#TODO: Display the first 10 rows of `titles_df`

Another thing that is often times helpful to do is inspect the types of each column in a dataframe. Output the types of titles_df in this cell below.

# TODO: Display the datatypes in `titles_df`

Save the types of the type, release_year, runtime, seasons, imdb_id, and tmdb_score columns to a series called titles_df_types (retaining the index names) and pass them into the autograder cell below.

# View the output here!
titles_df_types =

# TEST CASE: titles_df_types (2pt)
# [CIS 545 PennGrader Cell] – 2 points
grader.grade(test_case_id = ‘titles_df_types’, answer = titles_df_types)

1.1 Cleaning up Titles data (5 points)¶
When you work with data, you’ll have NaNs, duplicates or columns that don’t give much insight into the data. There are different ways to deal with missing values (i.e. imputation, which you can read into on your own), but for now, let’s drop some of these rows in titles_df to clean up our data. Note that there might be multiple ways to do each step. Also note that a lot of the columns in titles_df have all nulls. Thus, ensure to drop the unnecessary columns before filtering out rows with nulls

Refer to the documentation if you get stuck — it’s your best friend!

TODO: 1.1¶
Keep only the following columns:
id, title, type, release_year, runtime, genres, production_countries, imdb_score, imdb_votes, tmdb_popularity, tmdb_score.
Drop rows that have nulls (e.g. NaN) in them.
Use the info function to see the number of null rows in this DataFrame before this, and afterward to sense check that your operation is correct
Reset the index and drop the index column which stores the original index prior to resetting the index. We recommend you print out the intermediate dataframe prior to this to see that the indices are not consecutive!
Cast title, type to type string, and imdb_votes to type int.
Save the result to titles_cleaned_df.

#TODO: Keep only the necessary columns
titles_cleaned_df =

#TODO: Drop nulls

#TODO: Reset and drop the index

#TODO: Cast type

# TEST CASE: titles_cleaned_df (5pt)
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘titles_cleaned_df’, answer = titles_cleaned_df)

1.2 Data Wrangling with Titles Data (8 points)¶
Now, let’s process the data in an appropriate format so that we can answer some queries more easily. Make sure to use titles_cleaned_df for this part.

Create a column called is_movie that contains a value of 1 if the type of content is MOVIE and a value of 0 if not.
Create the genres_expanded column to create individual rows for each genre of each movie. Hint: Make sure it is the correct type before doing this!
Similar to before, create a production_countries_expanded column to create individual rows for each country where the movie was produced.
Drop the redundant columns type, genres, and production_countries, as well as all Na values, saving the result as titles_final_df. Make sure to reset and drop the index as well! (8 points)

Hint: See apply, explode, json.loads, lambda and to_datetime in Python documentation.

#TODO: Create is_movie, genres_expanded and production_countries_expanded

#TODO: Drop redundant columns, nulls, and the index

# TEST CASE: titles_final_df (8pt)
# [CIS 545 PennGrader Cell] – 8 points
grader.grade(test_case_id = ‘titles_final_df’, answer = titles_final_df)

1.3 Compute the Top Performing Genres¶

1.3.1 Compute the Best Genres By IMDb and TMDb Score (6 points)¶
In this section we will compute the top performing genres, and will use both data from the Internet Movie Database (IMDb) and The Movie Database (TMDb) to do so. We will use titles_final_df in this section.

TODO: 1.3.1

Create a dataframe genres with only the columns genres_expanded, tmdb_popularity, imdb_score and tmdb_score.
Filter genres to only keep those movies with tmdb_popularity greater than 2.0.
Create a dataframe genres_imdb_df that contains the average imdb_score. Make sure to keep the resultant genres_expanded and imdb_score columns
Sort this in descending order, keeping only the top 10 values
Create a column called score that is the average score rounded to two decimal places
Reset the index and drop the index column
Return only score and genres_expanded as part of genres_imdb_df
Do the same steps for creating genres_imdb_df to create genres_tmdb_df with tmdb_score instead!

#TODO: Create genres

#TODO: Create genres_imdb_df
genres_imdb_df =

#TODO: Create genres_tmdb_df
genres_tmdb_df =

# TEST CASE: genres_df (6pt)
# [CIS 545 PennGrader Cell] – 6 points
grader.grade(test_case_id = ‘genres_df’, answer = (genres_imdb_df, genres_tmdb_df))

1.3.2 Compute the Percentage Difference Between Genres (4 points)¶
In this section we will compute the differences in results between genres_imdb_df and genres_tmdb_df.

TODO: 1.3.2

Merge genres_imdb_df and genres_tmdb_df on genres_expanded to create merged_df. Use the fact that we want to calculate differences between the results to decide the type of merge you use!
Rename the score columns to score_imdb and score_tmdb respectively
Create a column difference in merged_df that is defined the absolute value of the percentage difference between score_imdb and score_tmdb. Hint: Check out the abs function for help with this!
Use the following formula for this:
\begin{align}
difference = abs(\frac{scoreimdb – scoretmdb}{scoreimdb})*100
\end{align}
Reset the index and drop the column
Sort merged_df in descending order by difference

#TODO: Create merged_df
merged_df =

# TEST CASE: merged_df (4pt)
# [CIS 545 PennGrader Cell] – 4 points
grader.grade(test_case_id = ‘merged_df’, answer = merged_df)

1.4 Finding Movie Variation By Decade¶
In this section we will compute the performance of movies by decade. We will first use titles_final_df to create titles_intermediate_df containing only unique titles that will be used throughout this section.

Drop genres_expanded and production_countries_expanded to create titles_intermediate_df
Drop duplicate rows
Create a column decade that represents the decade the movie has been released. For example, if release_year is 1994, the decade should return 1990.

#TODO: Create titles_intermediate_df
titles_intermediate_df =

1.4.1 Compute Bottom Movie Decades (5 points)¶
TODO: 1.4.1

See the groupby() function.
Create a dataframe bottom_titles_df with the percentage of movies for each decade. For example, if we have a total of 100 titles in the 1990’s, 20 of them are movies, then in the decade 1990, we should see 20.0 in the Percentage column
Reset the index and drop the column
Return the five lowest decades by percentage, with columns decade and Percentage

#TODO: Create bottom_titles_df
bottom_titles_df =

# TEST CASE: bottom_titles_df (5pt)
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘bottom_titles_df’, answer = bottom_titles_df)

1.4.2 Greatest Shift in Average Runtime (5 points)¶
We now will calculate the greatest shift in average runtime between decades as a percentage

TODO: 1.4.2

Create a dataframe average_runtime_df with the percentage change of average runtime for each decade (with regard to the previous decade
Sort this by highest percentage_shift first with columns decade, runtime and percentage_shift. Make sure to drop nulls and reset index after!

#TODO: Create average_runtime_df
average_runtime_df =

# TEST CASE: average_runtime_df (5pt)
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘average_runtime_df’, answer = average_runtime_df)

1.4.3 Ratio of Length of Movie Titles to Number of Individuals (7 points)¶
Adam wonders whether movies with longer titles have more people working on them. He decides to see how the ratio between the length of a movie title and the number of people working on the movie has changed over time, and buckets this by decade. We will use credits_df here, and we suggest that you explore the dataset using some of the functions we used in section 1.0. We now aim to answer his question using both titles_intermediate_df and credits_df.

TODO: 1.4.3

Create a dataframe titles_ratio_df with a join of titles_intermediate_df and credits_df that only contains titles in both dataframes
Create a ratio column that stores the ratio of the average number of people working on a film in a decade to the average length of the title (defined as the number of characters in the string) in the decade. The formula for ratio (for a decade) is:
\begin{align}
ratio = \frac{Average Number of People Working on Film}{Average Title Length}
\end{align}
Round the ratio to 2 decimal places and sort the dataframe by the ratio column. Store the highest ratio in the value highest_ratio. We have final schema decade, person_id, title_length, and ratio.

#TODO: Create titles_ratio_df and highest_ratio
titles_ratio_df =
highest_ratio =

# TEST CASE: titles_ratio_df (7pt)
# [CIS 545 PennGrader Cell] – 7 points
grader.grade(test_case_id = ‘titles_ratio_df’, answer = (titles_ratio_df, highest_ratio))

Part 2: Combining the data [35 points]¶
When you become a full time data scientist, a lot of times, data will be spread out across multiple files/tables. The way to combine these tables is through join/merge operations. If you’re familiar with SQL, this will be very familiar to you. If not, don’t worry. I believe in you!

To start, here’s a nice diagram which shows you the different types of joins

A clarifying point: The two venn diagrams with the “(if Null)” are also called Left Outer Join and Right Outer Join

2.1 TV Shows and Countries¶

2.1.1 IMDB votes per country (4 points)¶

TODO: 2.1.1¶
Using titles_final_df, create a new dataframe called new_titles_final_df which removes the genres_expanded column and drops duplicates.
Next, create intermediate_df which only contains movies with greater than 10,000 IMDB votes
Add a column to intermediate_df called count that is a 1 if the IMDB score is >= 6.
Return a dataframe called country_votes_df counting the number of films with IMDB scores of at least 6 for each country. There should be two columns: country and count. Return this df in sorted order by count in ascending order and country in alphabetically descending order.

Note: You may receive a warning message, which will not affect your output or your score.

#TODO: Create country_votes_df
country_votes_df =

# TEST CASE: country_votes_df (4pt)
# [CIS 545 PennGrader Cell] – 4 points
grader.grade(test_case_id = ‘country_votes_df’, answer = country_votes_df)

2.1.2 Most popular TV Show Actors(5 points)¶
TODO: Create an intermediate dataframe called shows_df, assuming all non-movies are TV shows, containing shows with at least 2000 votes. Merge credits_df with shows_df to obtain the number of IMDB votes each actor has received from all the shows they have been in, only keeping the records that appear in both dataframes. Group by name, and return a dataframe containing the name column and the imdb_votes column.

#TODO: Create country_votes_df
top_actors_df =

# TEST CASE: top_actors_df (5pt)
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘top_actors_df’, answer = top_actors_df)

2.2 Exploring Acting¶
We now want to see which actors are doing well, and we first check Comedy Actors and then the Highest Ranked Actors overall.

2.2.1 Comedy Actors (6 points)¶

Create a new dataframe comedy_actors_df that filters titles_final_df to only contain shows in the comedy genre (once again assuming all non-movies are shows).
Merge again to obtain the actors in comedy shows, only keeping the records that appear in both dataframes
Calculate the average tmdb_popularity of each actor
Create a new column called ranking which assigns the label of “low” if an actor averages less than 10,000 votes, “med” if an actor averages between 10,000 and 100,000 votes, and “high” if an actor averages greater than 100,000 votes.
comedy_actors_df should have columns person_id, name, and imdb_votes

#TODO: Create comedy_actors_df
comedy_actors_df =

# TEST CASE: comedy_actors_df (6pt)
# [CIS 545 PennGrader Cell] – 6 points
grader.grade(test_case_id = ‘comedy_actors_df’, answer = comedy_actors_df)

2.2.2 Finding Highest Ranked Actors (6 points)¶
We want to find the actors who have received the “high” ranking for both the comedy and drama genre without duplicates. Use the same logic as the previous question to obtain the highest ranked actors for the drama category, and find the actors that appear as highly ranked for both genres. Return this as an alphabetically ranked list called highest_ranked with actor names ensuring that there are no duplicates.

#TODO: Create highest_ranked
highest_ranked =

# TEST CASE: highest_ranked (6pt)
# [CIS 545 PennGrader Cell] – 6 points
grader.grade(test_case_id = ‘highest_ranked’, answer = highest_ranked)

2.3 Oscars: Finding the Best of Each Category!¶
Let’s now look at those who did the best in their profession, as well as those who improved well

2.3.1 Finding the Best of Each Profession (8 points)¶
Use titles_intermediate_df and credits_df in this question. For each profession in credits_df, do the following:

Find the average imdb_score for all movies a person has appeared in, make sure to there are at least 4 appearances for each person for each profession.
Append the name and score of the person with the highest imdb_score to lists best_role_names and best_role_values respectively. For example, best_role_names = [Martin Scorsese, Leonardo DiCaprio, Meryl Streep] and best_role_values = [8.3, 8.1, 8.6]

Hint: Check the unique function for help with finding the various professions!

# TODO: Create best_role_names and best_role_values
best_role_names =
best_role_values =

# TEST CASE: best_roles (8pt)
# [CIS 545 PennGrader Cell] – 8 points
grader.grade(test_case_id = ‘best_roles’, answer = (best_role_names, best_role_values))

2.3.2 Most Improved Individuals (6 points)¶
Use titles_intermediate_df and credits_df in this question. We are now interested in which individuals improved the most between the 2000s and 2010s.

Find the average imdb_score for individuals who have appeared in at least 2 movies in both the 2000s and 2010s and find the difference between them.
Return the 5 individuals with the highest difference (most improvement between the 2010s and 2000s) as best_individuals. For example, best_individuals = [‘Gerard Butler’, ‘Shahid Kapoor’, ‘Jennifer Lawrence’, ‘Leonardo DiCaprio’, ‘Meryl Streep’]

# TODO: Create best_individuals
best_individuals =

# TEST CASE: best_individuals (6pt)
# [CIS 545 PennGrader Cell] – 6 points
grader.grade(test_case_id = ‘best_individuals’, answer = best_individuals)

Part 3: Correlation Matrix [6 points]¶

3.1 Correlation Matrix (4 + 2 points)¶
Occasionally, there are unexpected correlations in the data. One way to find these correlations is to use a correlation matrix. We suspect that there might be a correlation between the imdb_score, imdb_votes, tmdb_popularity, and tmdb_score. But how strong is the correlation? Also, could there be any correlation between two seemingly uncorrelated features? If there is a correlation, how strong is it?

In this section, we will create the correlation matrix for titles_intermediate_df.

TODO: 3.1¶
Create a dataframe called subset_titles that only contains the following columns from titles_intermediate_df: imdb_score, imdb_votes, tmdb_popularity, tmdb_score.

Generate the correlation matrix. Hint: Read about Pandas “corr()” function.

Name your final answer correlation matrix dataframe to: correlation_matrix (4 points)

Plot a correlation matrix — just to get a sense of what it might look like!

#TODO: Create correlation matrix
correlation_matrix =

# TEST CASE: correlation_matrix (4pt)
# [CIS 545 PennGrader Cell] – 4 points
grader.grade(test_case_id = ‘correlation_matrix’, answer = correlation_matrix)

Here we provide code for you to visualize the correlation matrix. In the following code snippet below, please assign your correlation matrix to the variable named “corr” and then run the cell. You should see a correlation matrix!

sns.set(style = “white”)

# Generate a large random dataset
rs = np.random.RandomState(33)
d = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))

# Compute the correlation matrix
# ASSIGN THE “corr” VARIABLE TO YOUR CORRELATION MATRIX

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={“shrink”: .5})
plt.title(“Correlation Heatmap”)
plt.show()

Part 4: XPath [15 points]¶

So far, we’ve looked at Amazon Prime film data. Let’s change our scope to look at the most successful films of all time! The datasets we provide are compiled for us on Kaggle, but sometimes we need to create our own datasets! We’ll do some web scraping and work web-based data!

On wikipedia.org, we have a table with highest grossing film data.

We get the DOM tree for you below. Recall that the DOM tree is just a tree made up of objects, which are elements, attributes, or text nodes (and a few other things we’ll ignore). Any XML or HTML document can be parsed to build a DOM tree.

Key XPath Concept Review¶
We use XPath to match tree patterns against DOM trees.

XPath has a few main ideas:

Navigation over structure

Child element by position $node$ [i] returns the $i$th child of $node$
Note that this is 1-based, i.e., $node$ [1] is the first child

Child element by name ($node$ / step) Note you can combine the above 2 ideas, e.g., $node$ / step[2] is the 2nd step child

Child attributes ($node$
Child text values ($node$ /text())
We can generalize each of the above by replacing a single / with a //, which now finds matches that are children or descendants. e.g., $node$ //text() would return any text content within $node$

Predicates

$node$ [ test ] evaluates whether a test is satisfied by $node$ e.g., $node$ [a] means there exists at least one a child
e.g., $node$ means there exists an attr attribute with value b

In this question, we’ll build incrementally to this. We’ll attach nodes to DOM trees, and match XPaths against these.

Recall that XPaths return ordered node sets, or in Python this really means they return lists of nodes from the document, in the order in which the nodes appeared, without duplicates.

We will use the syntax:

node_set = variable_representing_node.xpath(“…”)

to get an output node set from an initial DOM node. We can also:

for node in node_set:
print(node.xpath(“…”))

or the like.

Here’s a helper function…

## Simple pretty-printer, from https://stackoverflow.com/questions/5086922/python-pretty-xml-printer-with-lxml

from typing import Optional

import lxml.etree

def indent_lxml(element: lxml.etree.Element, level: int = 0, is_last_child: bool = True) -> None:
space = ” ”
indent_str = “\n” + level * space

element.text = strip_or_null(element.text)
if element.text:
element.text = f”{indent_str}{space}{element.text}”

num_children = len(element)
if num_children:
element.text = f”{element.text or ”}{indent_str}{space}”

for index, child in enumerate(element.iterchildren()):
is_last = index == num_children – 1
indent_lxml(child, level + 1, is_last)

elif element.text:
element.text += indent_str

tail_level = max(0, level – 1) if is_last_child else level
tail_indent = “\n” + tail_level * space
tail = strip_or_null(element.tail)
element.tail = f”{indent_str}{tail}{tail_indent}” if tail else tail_indent

def strip_or_null(text: Optional[str]) -> Optional[str]:
if text is not None:
return text.strip() or None

Let’s see it in action, first to view the entire HTML document.

# Request the data and build the DOM tree (we’ve done this for you!)
# w = requests.get(“https://en.wikipedia.org/wiki/2022_Major_League_Soccer_season”)
# dom_tree = html.fromstring(w.content)
# print(dom_tree)

# indent_lxml(dom_tree) # corrects indentation “in place”

# result = lxml.etree.tostring(dom_tree, encoding=”unicode”)
# print(result)

w = requests.get(“https://en.wikipedia.org/wiki/List_of_highest-grossing_films”)
dom_tree = html.fromstring(w.content)
print(dom_tree)

indent_lxml(dom_tree) # corrects indentation “in place”

result = lxml.etree.tostring(dom_tree, encoding=”unicode”)
print(result)

4.1: Update dom_tree to get only the highest grossing films in 2022 adjusted for inflation¶

Note the webpage has multiple tables. To find all of these tables we can use the xpath //table/*

for node in dom_tree.xpath(“//table/*”):
result = lxml.etree.tostring(node, encoding=”unicode”)
print(‘**’)
print(result)

We are only interested in the table with the highest grossing films adjusted for inflation (second table).

Create updated_dom_tree to contain the tbody element of the movie table.

Hint: What does each index of the above xpath search contain?

updated_dom_tree =

result = lxml.etree.tostring(updated_dom_tree, encoding=”unicode”)
print(result)

4.2 Film Names (3 points)¶
TODO: Create x_path_film_name and film_names.

x_path_film_name should be the value you pass in for updated_dom_tree.xpath() to retrieve names of the films.

It should be in the form of ‘/…/text()’ (may vary slightly; only include the actual parameter used in dom_tree.xpath!).

Hint: Since we updated the dom tree to start at the table we don’t need ‘/…/table’, we can instead start the xpath with the next element after ‘table’ we want. It should be in the form of ‘/…/text()’ (may vary slightly; only include the actual parameter used in dom_tree.xpath!)

You can use the ‘inspect’ tool on the website and see the various html tags and labels, to figure out how we generate the Xpath for the table!

Your job is to go into several more steps to get the text content of one column of each row, to get the entire x_path_film_name string

Return film_names which is a list with all the film names.

# TODO: Define the xpath string

x_path_film_name =

film_names =