COMP 4446 5046 Assignment 1
2023 COMP 4446 / 5046 Assignment 1¶
Assingment 1 is an individual assessment. Please note the University’s Academic dishonesty and plagiarism policy.
Submission Deadline: Friday, March 17th, 2023, 11:59pm
Submit via Canvas:
Your notebook
Run all cells before saving the notebook, so we can see your output
In this assignment, we will explore ways to predict the length of a Wikipedia article based on the first 100 tokens in the article. Such a model could be used to explore whether there are systematic biases in the types of articles that get more detail.
If you are working in another language, please make sure to clearly indicate which part of your code is running which section of the assignment and produce output that provides all necessary information. Submit your code, example outputs, and instructions for executing it.
Note: This assignment contains topics that are not covered at the time of release. Each section has information about which lectures and/or labs covered the relevant material. We are releasing it now so you can (1) start working on some parts early, and (2) know what will be in the assignment when you attend the relevant labs and lectures.
TODO: Copy and Name this File¶
Make a copy of this notebook in your own Google Drive (File -> Save a Copy in Drive) and change the filename, replacing YOUR-UNIKEY. For example, for a person with unikey mcol1997, the filename should be:
COMP-4446-5046_Assignment1_mcol1997.ipynb
If there is something you want to tell the marker about your submission, please mention it here.
[write here – optional]
Data Download [DO NOT MODIFY THIS]¶
We have already constructed a dataset for you using a recent dump of data from Wikipedia. Both the training and test datasets are provided in the form of csv files (training_data.csv, test_data.csv) and can be downloaded from Google Drive using the code below. Each row of the data contains:
The length of the article
The title of the article
The first 100 tokens of the article
In case you are curious, we constructed this dataset as follows:
Downloaded a recent dump of English wikipedia.
Ran WikiExtractor to get the contents of the pages.
Filtered out very short pages.
Ran SpaCy with the en_core_web_lg model to tokenise the pages (Note, SpaCy’s development is led by an alumnus of USyd!).
Counted the tokens and saved the relevant data in the format described above.
This code will download the data. DO NOT MODIFY IT
## DO NOT MODIFY THIS CODE
# Code to download files into Colaboratory
# Install the PyDrive library
!pip install -U -q PyDrive
# Import libraries for accessing Google Drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Function to read the file, save it on the machine this colab is running on, and then read it in
import csv
def read_file(file_id, filename):
downloaded = drive.CreateFile({‘id’:file_id})
downloaded.GetContentFile(filename)
with open(filename) as src:
reader = csv.reader(src)
data = [r for r in reader]
return data
# Calls to get the data
# If you need to access the data directly (e.g., you are running experiments on a local machine), use these links:
# – Training, https://drive.google.com/file/d/1-UGFS8D-qglAX-czU38KaM4jQVCoNe0W/view?usp=share_link
# – Dev, https://drive.google.com/file/d/1RWMEf0mdJMTkWc7dvN0ioks8bjujqZaN/view?usp=share_link
# – Test, https://drive.google.com/file/d/1YVPNzdIFSMmVPeLBP-gf5DOIed3oRFyB/view?usp=share_link
training_data = read_file(‘1-UGFS8D-qglAX-czU38KaM4jQVCoNe0W’, “/content/training_data.csv”)
dev_data = read_file(‘1RWMEf0mdJMTkWc7dvN0ioks8bjujqZaN’, “/content/dev_data.csv”)
test_data = read_file(‘1YVPNzdIFSMmVPeLBP-gf5DOIed3oRFyB’, “/content/test_data.csv”)
print(“————————————“)
print(“Size of training data: {0}”.format(len(training_data)))
print(“Size of development data: {0}”.format(len(dev_data)))
print(“Size of test data: {0}”.format(len(test_data)))
print(“————————————“)
print(“————————————“)
print(“Sample Data”)
print(“LABEL: {0} / SENTENCE: {1}”.format(training_data[0][0], training_data[0][1:]))
print(“————————————“)
# Preview of the data in the csv file, which has three columns:
# (1) length of article, (2) title of the article, (3) first 100 words in the article
for v in training_data[:10]:
print(“{}\n{}\n{}\n”.format(v[0], v[1], v[2][:100] + “…”))
# Store the data in lists and mofidy the length value to be in [0, 1]
training_lengths = [min(1.0, int(r[0])/10000) for r in training_data]
training_text = [r[2] for r in training_data]
dev_lengths = [min(1.0, int(r[0])/10000) for r in dev_data]
dev_text = [r[2] for r in dev_data]
test_lengths = [min(1.0, int(r[0])/10000) for r in test_data]
test_text = [r[2] for r in test_data]
————————————
Size of training data: 9859
Size of development data: 994
Size of test data: 991
————————————
————————————
Sample Data
LABEL: 6453 / SENTENCE: [‘Anarchism’, ‘Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy , typically including , though not necessarily limited to , governments , nation states , and capitalism . Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations . As a historically left – wing movement , usually placed on the farthest left of the political spectrum , it is usually described alongside communalism and libertarian Marxism as the libertarian wing ( libertarian socialism )’]
————————————
Anarchism is a political philosophy and movement that is skeptical of all justifications for authori…
Albedo (; ) is the measure of the diffuse reflection of solar radiation out of the total solar radia…
A , or a , is the first letter and the first vowel of the Latin alphabet , used in the modern Englis…
Alabama ( ) is a state in the Southeastern region of the United States , bordered by Tennessee to th…
In Greek mythology , Achilles ( ) or Achilleus ( ) was a hero of the Trojan War , the greatest of al…
Abraham Lincoln
Abraham Lincoln ( ; February 12 , 1809 – April 15 , 1865 ) was an American lawyer , politician , a…
Aristotle (; ” Aristotélēs ” , ; 384–322 BC ) was a Greek philosopher and polymath during the Clas…
An American in Paris
An American in Paris is a jazz – influenced orchestral piece by American composer George Gershwin fi…
Academy Award for Best Production Design
The Academy Award for Best Production Design recognizes achievement for art direction in film . The …
Academy Awards
The Academy Awards , better known as the Oscars , are awards for artistic and technical merit for th…
1 – Predicting article length from initial content¶
This section relates to content from the week 1 lecture and the week 2 lab.
In this section, you will implement training and evaluation of a linear model (as seen in the week 2 lab) to predict the length of a wikipedia article from its first 100 words. You will represent the text using a Bag of Words model (as seen in the week 1 lecture).
1.1 Word Mapping [2pt]¶
In the code block below, write code to go through the training data and for any word that occurs at least 10 times:
Assign it a unique ID (consecutive, starting at 0)
Place it in a dictionary that maps from the word to the ID
# Your code goes here
1.2 Data to Bag-of-Words Tensors [2pt]¶
In the code block below, write code to prepare the data in PyTorch tensors.
The text should be converted to a bag of words (ie., a vector the length of the vocabulary in the mapping in the previous step, with counts of the words in the text).
# Your code goes here
1.3 Model Creation [2pt]¶
Construct a linear model with an SGD optimiser (we recommend a learning rate of 1e-4) and mean squared error as the loss.
# Your code goes here
1.4 Training [2pt]¶
Write a loop to train your model for 100 epochs, printing performance on the dev set every 10 epochs.
# Your code goes here
1.1 Measure Accuracy [2pt]¶
In the code block below, write code to evaluate your model on the test set.
# Your code goes here
1.2 Analyse the Model [2pt]¶
In the code block below, write code to identify the 50 words with the highest weights and the 50 words with the lowest weights.
# Your code goes here
2 – Compare Data Storage Methods¶
This section relates to content from the week 1 lecture and the week 2 lab.
Implement a variant of the model with a sparse vector for your input bag of words (See https://pytorch.org/docs/stable/sparse.html for how to switch a vector to be sparse). Use the default sparse vector type (COO).
# Your code goes here
2.1 Training and Test Speed [2pt]¶
Compare the time it takes to train and test the new model with the time it takes to train and test the old model.
You can time the execution of a line of code using %time.
See this guide for more on timing.
# Example of timing
%time print(“hi!”)
CPU times: user 89 µs, sys: 867 µs, total: 956 µs
Wall time: 963 µs
# Your code goes here
3 – Switch to Word Embeddings¶
This section relates to content from the week 2 lecture and the week 3 lab.
In this section, you will implement a model based on word2vec.
Use word2vec to learn embeddings for the words in your data.
Represent each input document as the average of the word vectors for the words it contains.
Train a linear regression model.
# Your code goes here
3.1 Accuracy [1pt]¶
Calculate the accuracy of your model.
# Your code goes here
3.2 Speed [1pt]¶
Calcualte how long it takes your model to be evaluated.
# Your code goes here
4 – Open-Ended Improvement¶
This section relates to content from the week 1, 2, and 3 lectures and the week 1, 2, and 3 labs.
This section is an open-ended opportunity to find ways to make your model more accurate and/or faster (e.g., use WordNet to generalise words, try different word features, other optimisers, etc).
We encourage you to try several ideas to provide scope for comparisons.
If none of your ideas work you can still get full marks for this section. You just need to justify the ideas and discuss why they may not have improved performance.
4.1 Ideas and Motivation [1pt]¶
In this box, describe your ideas and why you think they will improve accuracy and/or speed.
Your answer goes here
4.2 Implementation [2pt]¶
Implement your ideas
# Your code goes here
4.3 Evaluation [1pt]¶
Evaluate the speed and accuracy of the model with your ideas
# Your code goes here
In this text box, briefly describe the results. Did your improvement work? Why / Why not?
Your answer goes here