assignment2

Assignment 2: Recurrent neural networks¶
Due: by 11:59pm, Wednesday 6/7¶
Note: late submissions are accepted up to 5 days after the deadline for this assignment. You will receive 4 free late days to use between the two assignments of the course; once you exhaust those late days, each (partial) day that a submission is late will cap your maximum grade for the assignment by 10%. If you submit this assignment more than 5 days late, you will receive no credit.

Before submitting, please click the Kernel menu at the top of JupyterLab and select the Restart Kernel and Run All Cells… button, then click the red Restart button on the box that pops up. This ensures that your code will run on its own, without relying on information that you may have added and then removed when developing it. Please note that some parts of the code may be time-consuming to run. You should plan accordingly so that you don’t start rerunning the code too close to the deadline.

To submit your assignment, first make sure you have saved it, by pressing Ctrl + S or Cmd + S or clicking on the disk icon at the top of the notebook. Then use git / GitHub Desktop to add your completed assignment file(s) to git tracking, commit them to your repository with the message Final submission, and push them to GitHub. *We recommend going through the process of submitting sufficiently early that you can come to office hours to get help if needed; you can submit multiple times, and only your final submission (with the commit message Final submission will be graded).

Please answer this question prior to submitting¶

**Did you consult anyone other than the TA/instructor, or any resources other than those listed on GauchoSpace, for this assignment? If so, please list them below.**

It is fine to consult classmates and external resources, as long as the work you submit is your own. We would just like to know who you consulted, or what other resources you might have found helpful.

Double-click on this text cell to enter your response

Part 1: RNNs in PyTorch [50 pts]¶
In this part of the assignment, you’ll learn how to create recurrent neural networks in PyTorch, and you’ll show your conceptual understanding by exploring and explaining key design decisions.

Here are some other tutorials that explore aspects of RNNs we aren’t able to fit in this assignment:

NLP From Scratch: Classifying Names with a Character-Level RNN
Generating spam from a character-level RNN
NLP From Scratch: Generating Names with a Character-Level RNN
Seq2seq learning with neural networks
NLP From Scratch: Translation with a Sequence to Sequence Network and Attention

To begin with, run the cell below to import everything you’ll need. It will also create several functions and variables from Assignment 1 that we’ll be reusing here.

import torch
from torch import nn, optim, tensor
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.utils.data import DataLoader, Dataset
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, Vectors
import pandas as pd
import time

pd.set_option(‘display.max_colwidth’, None)

# ========================
# LOADING DATA – GENERAL

class TSVDataset(Dataset):

def __init__(self, filepath):
“””Loads the data from a provided filepath”””
self.data = list()
with open(filepath, encoding=”utf-8″) as in_file:
for line in in_file:
(label, text) = line.strip().split(“\t”)
self.data.append((label, text))

def __getitem__(self, idx):
“””Returns the datapoint at a given index”””
return self.data[idx]

def __len__(self):
“””Returns the number of datapoints in the dataset”””
return len(self.data)

spacy_tokenizer = get_tokenizer(‘spacy’, language=”en_core_web_sm”)
tokenizer = lambda text: [token.lower() for token in spacy_tokenizer(text)]

def text_to_indices(text):
tokens = tokenizer(text)
indices = vocab(tokens)
return torch.tensor(indices, dtype=torch.int64)

def label_to_index(label):
return int(label == “pos”)

def data_to_indices(data):
(label, text) = data
return (label_to_index(label), text_to_indices(text))

train_data = TSVDataset(“inputs/imdb-train.tsv”)
test_data = TSVDataset(“inputs/imdb-test.tsv”)

# =================================================
# SETTING UP THE VOCAB AND EMBEDDINGS – GENERAL

def yield_tokens(data):
“””A generator for tokenizing text in a (label, text) pair”””
for _, text in data:
yield tokenizer(text)

tokenized_iter = yield_tokens(train_data)
embeddings = Vectors(“inputs/glove_6B_50_sample_train.txt”)

# ========================================
# TRAINING AND TESTING – GENERAL

def train(model, dataloader, optimizer, epochs=100, print_every=1,
validation_data=None):
“””Train a PyTorch model and print results periodically

model: torch.nn.Module; the model to be trained
dataloader: torch.utils.data.DataLoader; the training data
optimizer: the PyTorch optimizer to use for training
epochs: int; the number of complete cycles through the training data
print_every: int; print the results after this many epochs
(does not print if this is None)
validation_data: torch.utils.data.DataLoader; the validation data
start_time = time.time()

if print_every is not None:
# Print initial performance
initial_performance = test(model, dataloader)
log_message = ‘| epoch 0 | train acc {acc:6.3f} | train loss {loss:6.3f} |’.format(**initial_performance)
if validation_data is not None:
validation_performance = test(model, validation_data)
log_message += ‘ valid acc {acc:6.3f} | valid loss {loss:6.3f} |’.format(**validation_performance)
print(log_message)

# Set up trackers for printing results along the way
total_acc = 0
total_count = 0
current_loss = 0.0
minibatches_per_log = len(dataloader) * print_every

# Tell the model that these inputs will be used for training
model.train()

for epoch in range(epochs):
# Within each epoch, iterate over the data in mini-batches
# Note the use of *datapoint_list for generality, whether or not there are offsets
for (label_list, *datapoint_list) in dataloader:

# Clear out gradients accumulated from inputs in the previous mini-batch
model.zero_grad()

# Run the forward pass to make predictions for the mini-batch
predicted_probs = model(*datapoint_list).view(-1)

# Compute the loss and send it backward through the network to get gradients
# Note: PyTorch averages the loss over all datapoints in the minibatch
loss = model.loss_function(predicted_probs, label_list.to(torch.float32))
loss.backward()

# Nudge the weights
optimizer.step()

# Track performance
if print_every is not None:
total_acc += ((predicted_probs > 0.5).to(torch.int64) == label_list).sum().item()
total_count += label_list.size(0)
current_loss += loss.item()

# Log performance
if print_every is not None and (epoch + 1) % print_every == 0:
log_message = (‘| epoch {:3d} | train acc {:6.3f} | train loss {:6.3f} |’
.format(epoch + 1, total_acc/total_count, current_loss/minibatches_per_log))
if validation_data is not None:
validation_performance = test(model, validation_data)
log_message += ‘ valid acc {acc:6.3f} | valid loss {loss:6.3f} |’.format(**validation_performance)
print(log_message)

# Reset trackers after logging
total_acc = 0
total_count = 0
current_loss = 0.0
model.train()

print(“\nOverall training time: {:.0f} seconds”.format(time.time() – start_time))

def test(model, dataloader):
“””Evaluate a PyTorch model by testing it on labeled data

model: torch.nn.Module; the model to be tested
dataloader: torch.utils.data.DataLoader; the test data
# Tell the model that these inputs will be used for evaluation
model.eval()

# Set up trackers
total_acc = 0
total_count = 0
loss = 0.0

with torch.no_grad(): # This can speed things up by telling PyTorch to ignore gradients
# Note the use of *datapoint_list for generality, whether or not there are offsets
for (label_list, *datapoint_list) in dataloader:
# Get the model’s output predictions
predicted_probs = model(*datapoint_list).view(-1)
predicted_labels = (predicted_probs > 0.5).to(torch.int64)

# Calculate the loss and accuracy
loss += model.loss_function(predicted_probs, label_list.to(torch.float32)).item()
total_acc += (predicted_labels == label_list).sum().item()
total_count += label_list.size(0)

performance = {“acc”: total_acc/total_count, “loss”: loss/len(dataloader)}
return performance

# ==================================
# INSPECTING A MODEL – GENERAL

def display_weights(model):
“””Prints the weights of a model”””
for name, param in model.named_parameters():
print(name.upper(), param)

def predict_multiple(model, texts, collate_batch_fn, labels=[“neg”, “pos”]):
“””Prints a model’s predictions for a list of input texts.

model: torch.nn.Module; a PyTorch RNN model
texts: list(str); a list of untokenized strings to feed as input to the model
collate_batch_fn: function; a function that is used to prepare (batched) data
to be input into the model
labels: list(str); a list of the labels that correspond to the indices the
model will output
# Tell the model not to use these inputs for training
model.eval()

# Convert the input texts to indices, and get other model arguments needed
data = [(None, text) for text in texts]
(_, *model_input) = collate_batch_fn(data)

# Feed the inputs through the model
with torch.no_grad():
probs = model(*model_input).view(-1)

# Collate the predictions in a DataFrame
predictions = pd.DataFrame({“Input text”: texts, “Classifier probability”: probs})
predictions[“Output label”] = labels[0]
predictions.loc[predictions[“Classifier probability”] > 0.5, “Output label”] = “pos”
return predictions

# =================================
# LOADING DATA – SPECIFIC TO BOE

def collate_batch_boe(batch):
“””Converts a batch of data into PyTorch tensor format, and collates
the results by label, text, and offset, for use in a bag-of-embeddings
# Initialize lists that separate out the three components
label_list = list()
text_list = list()
offsets_list = [0]

for data in batch:
# Convert to PyTorch format
(label_index, text_indices) = data_to_indices(data)
# Add converted data to separate component lists
label_list.append(label_index)
text_list.append(text_indices)
offsets_list.append(text_indices.size(0))

# Convert everything to tensors
label_tensor = torch.tensor(label_list, dtype=torch.int64)
text_tensor = torch.cat(text_list)
offsets_tensor = torch.tensor(offsets_list[:-1]).cumsum(dim=0)

return (label_tensor, text_tensor, offsets_tensor)

train_dataloader_boe = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate_batch_boe)
test_dataloader_boe = DataLoader(test_data, batch_size=1000, collate_fn=collate_batch_boe)

Activity 1.1: Comparing recurrent and bag-of-embeddings classifiers¶
In this activity, we’ll construct sentiment analysis models for reviews in two ways: using a bag-of-embeddings approach, and using a recurrent neural network.

The model that we’ll use for a bag-of-embeddings approach is a simpler version of the BagOfEmbeddingsBinaryClassifier class from Assignment 1. This model forms a document embedding from the average of the embeddings for all the words in the document, then uses that document embedding in a feedforward neural network with a single hidden layer in order to predict whether the document has positive or negative sentiment.

The code cell below defines a class that can be used as a factory for the BoE model.

class BagOfEmbeddingsBinaryClassifier(nn.Module):

def __init__(self, vocab, embeddings, hidden_dim, freeze_embeddings=True):
super(BagOfEmbeddingsBinaryClassifier, self).__init__()

self.vocab = vocab

vocab_embeddings = embeddings.get_vecs_by_tokens(self.vocab.get_itos())
self.embedding = nn.EmbeddingBag.from_pretrained(vocab_embeddings, freeze=freeze_embeddings, mode=”mean”)

# The hidden layer will go from the embeddings to a layer of hidden_dim units
self.hidden_layer = nn.Sequential(
nn.Linear(embeddings.dim, hidden_dim),

# The output layer will go from the hidden layer (hidden_dim units) to a single unit
self.output_layer = nn.Sequential(
nn.Linear(hidden_dim, 1),
nn.Sigmoid()

self.loss_function = nn.BCELoss()

def forward(self, text, offsets):
# The hidden layer slots in between the embeddings and the outputs
doc_embedding = self.embedding(text, offsets)
hidden = self.hidden_layer(doc_embedding)
output = self.output_layer(hidden)
return output

For the RNN, we’ll use a unidirectional (forward) model that is as similar to the bag-of-embeddings model as possible. That means the model will take the sequence of word embeddings in the document as input, combine them into a document embedding using recurrence, and then use that document embedding in a feedforward neural network with a single hidden layer to make the prediction. We’ll have all the layers be the same size as they are in the BoE model; that means the recurrent layer will be the same size as the input embeddings (so that the resultant document embedding is also the same size as the word embeddings, like it is in BoE), and the hidden layer in the classifier will be the same size as it is in BoE.

The code cell below defines a class that can be used as a factory for the RNN model, using either “tanh” or “relu” as the recurrent_activation function. We will only use tanh (the default) in this assignment.

class RNNClassifier(nn.Module):

def __init__(self, vocab, embeddings, recurrent_dim, hidden_dim, freeze_embeddings=True,
recurrent_activation=”tanh”, recurrent_layers=1, recurrent_bidirectional=False):
super(RNNClassifier, self).__init__()

self.vocab = vocab

vocab_embeddings = embeddings.get_vecs_by_tokens(self.vocab.get_itos())
padding_idx = self.vocab.get_stoi().get(““) # Get the index
self.embedding = nn.Embedding.from_pretrained(vocab_embeddings, freeze=freeze_embeddings,
padding_idx=padding_idx) # Tell PyTorch that is for padding

# The embeddings go into an RNN layer with recurrent_dim units
self.recurrent_layer = nn.RNN(embeddings.dim, recurrent_dim, nonlinearity=recurrent_activation,
num_layers=recurrent_layers, bidirectional=recurrent_bidirectional,
batch_first=True) # Because we’ll make the mini-batch a list of sequences

# The recurrent output creates a doc_embedding, which feeds into a of hidden_dim units
# We’ll be concatenating the forward and backward direction of all layers
# from the recurrent output, so the doc_embedding will be sized accordingly
doc_embedding_dim = recurrent_dim * recurrent_layers * int(1 + recurrent_bidirectional)
self.hidden_layer = nn.Sequential(
nn.Linear(doc_embedding_dim, hidden_dim),

# The output layer will go from the hidden layer (hidden_dim units) to a single unit
self.output_layer = nn.Sequential(
nn.Linear(hidden_dim, 1),
nn.Sigmoid()

self.loss_function = nn.BCELoss()

def forward(self, padded_text, seq_lengths):
word_embeddings = self.embedding(padded_text)

# The sequence of word embeddings has to be packed for efficiency of the RNN
packed_word_embeddings = pack_padded_sequence(word_embeddings, seq_lengths, batch_first=True, enforce_sorted=False)
(final_layer_all_timesteps, all_layers_final_timestep) = self.recurrent_layer(packed_word_embeddings)

# all_layers_final_timestep contains the activations of all (stacked / bidirectional) recurrent
# layers at the final timestep for each sequence (taking the padding into account).
# For our classifier, we will stick all of these layers together (forward + backward,
# for each stacked layer) to use as the document embedding.
# all_layers_final_timestep has shape (num_layers, minibatch_size, recurrent_dim);
# we want something of shape (minibatch_size, num_layers * recurrent_dim),
# so we reorder the dimensions and then reshape to stick everything together
minibatch_size = all_layers_final_timestep.size(1)
doc_embedding = all_layers_final_timestep.permute(1, 0, 2).reshape(minibatch_size, -1)

hidden = self.hidden_layer(doc_embedding)
output = self.output_layer(hidden)
return output

All that differs between our two models is how they make the document embedding from the word embeddings: BoE uses (mean) pooling, while RNN uses recurrence. The BoE model can therefore be thought of as a special case of the RNN model, where there is no activation function at the recurrent layer, the input weights are fixed to $I/n$ (where $I$ is the identity matrix; this divides each word embedding by the number of words in the document, then adds it to the pool at the recurrent layer), and the recurrent weights are fixed to $I$ (where $I$ is the identity matrix; this carries forward all of the information pooled from the previous timesteps).

MINI-BATCHING IN AN RNN

Recall that we usually provide training samples to the model in mini-batches, for efficiency. All of the training samples in a mini-batch are processed together. For feedfoward NNs, the input in each training sample was a vector of the same size; but for RNNs, the input in each training sample is a sequence of vectors, and different sequences may have different lengths. In order to form mini-batches appropriately, we have to ensure that all of the sequences in a mini-batch are the same length.

Of course, we probably don’t want to actually change the length of the input sequences in our training data, by cutting long sequences short. So instead, we pad short sequences to make them longer. PyTorch knows that this padding is only there for technical reasons, and shouldn’t count when making predictions or doing backprop.

To be able to pad the sequences, we first have to add a special symbol “” to our vocabulary (like ““). The best time to do this is when loading the data, by including “” in the list of specials given as keyword argument to the build_vocab_from_iterator() function. The specials are represented by the first few indices in the vocab; for example, if specials=[““, ““], then “” will be represented by the index 0 and “” will be represented by the index 1:

vocab = build_vocab_from_iterator(tokenized_iter, specials=[““, ““])
vocab.set_default_index(vocab[““])

padding_idx = vocab.get_stoi().get(““)
print(“The symbol in this vocabulary is represented by the index {}”.format(padding_idx))

We also have to tell the input (self.embedding) layer what the padding index is, which is accomplished through the padding_idx keyword argument of the nn.Embedding.from_pretrained() method in the RNNClassifier class above.

One we have a pad token in the vocab, we need to use it to pad out the mini-batch of sequences, and we need to tell PyTorch to ignore this padding when making predictions or doing backprop (which is accomplished by packing the sequence). PyTorch provides two functions to help with this: pad_sequence() and pack_padded_sequence() (both in torch.nn.utils.rnn). We use pad_sequence() in our custom collate_batch() function that we use to get mini-batches from a DataLoader, and pack_padded_sequence() in the forward() method of the RNNClassifier (because it is the embeddings that need to be packed, and they are looked up inside forward()).

def collate_batch_rnn(batch):
“””Converts a batch of sequence data into padded and packed PyTorch
tensor format, and collates the results by label, text, and sequence
length, for use in a RNN model.
# Initialize lists that separate out the two components
label_list = list()
text_list = list()
seq_lengths = list()

# Convert to mini-batch tensors
label_tensor = torch.tensor(label_list, dtype=torch.int64)
text_tensor = pad_sequence(text_list, batch_first=True, padding_value=padding_idx)

return (label_tensor, text_tensor, seq_lengths)

train_dataloader_rnn = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate_batch_rnn)
test_dataloader_rnn = DataLoader(test_data, batch_size=1000, collate_fn=collate_batch_rnn)

Note that padding complicates matters somewhat: we want the classifier to use the recurrent layer activations from the last non-padding timestep of each sequence, but they may all be in different positions! This is the reason for the extra steps taken to get doc_embedding inside RNNClassifier.foward().

Questions 1.1.1 – 1.1.2¶
The code cell below shows an example of padding. Run it, then answer the following question:

# Demonstration of padding
texts = [“I hated it”, “it was quite terrible”, “i really REALLY loved it”]
print(“texts:\n{}\n”.format(texts))

text_indices = [text_to_indices(text) for text in texts]
print(“Converted to indices:\n{}\n”.format(text_indices))

padded = pad_sequence(text_indices, batch_first=True, padding_value=padding_idx)
print(“Padded:\n{}”.format(padded))

QUESTION 1.1.1. [1 point]

How do the padded sequences relate to the original sequences? What are the 0s in the padded tensor, why are they where they are, and why are there as many as there are?

Double-click on this cell to enter your response.

The code cell below demonstrates the importance of telling PyTorch to ignore padding, through use of packing. It defines a simple RNN in which the recurrent layer activation at each timestep increases by $1 + x_t$, where $x_t$ is the input value at timestep $t$. Thus, the recurrent layer activation at timestep $t$ is equal to $t$ plus the sum of the input values at all timesteps up to and including the current timestep.

Run the code cell, then answer the following question:

with torch.no_grad(): # This tells PyTorch we aren’t going to be doing backprop

# Define simple embedding and recurrent layers.
# The embedding layer just contains the index of the input
# The recurrent layer has a bias of 1 and all weights fixed to 1,
# so it adds together all of the inputs up to the current timestep,
# plus the number of timesteps since the beginning of the sequence
embedding_layer = nn.Embedding.from_pretrained(torch.arange(5000, dtype=torch.float32).view(-1, 1), 1, padding_idx=padding_idx)
recurrent_layer = nn.RNN(1, 1, batch_first=True, nonlinearity=”relu”)
recurrent_layer.bias_ih_l0[0] = 0.0
recurrent_layer.bias_hh_l0[0] = 1.0
recurrent_layer.weight_ih_l0[0, 0] = 1.0
recurrent_layer.weight_hh_l0[0, 0] = 1.0

# Get the word embeddings for the padded sequence
word_embeddings = embedding_layer(padded)

# With packing: pack the word embeddings,
# run the recurrent layer on the packed padded embeddings,
# and then unpack the results
packed_word_embeddings = pack_padded_sequence(word_embeddings, [3, 4, 5], batch_first=True, enforce_sorted=False)
(final_layer_all_timesteps, all_layers_final_timestep) = recurrent_layer(packed_word_embeddings)
final_layer_all_timesteps = pad_packed_sequence(final_layer_all_timesteps, batch_first=True, padding_value=padding_idx)

print(“===========================”)
print(“WITH PACKING”)
print(“—————————“)
print(“Recurrent layer activation at all timesteps, for each sequence:”)
print(final_layer_all_timesteps[0].view(3, -1))
print(“\nWhat PyTorch gets as the recurrent layer activation at the \”final\” timestep for each sequence:”)
for (seqnum, value) in enumerate(list(all_layers_final_timestep.view(-1))):
print(“Sequence {}: {}”.format(seqnum + 1, int(value)))

# Without packing: run the recurrent layer on the padded embeddings directly
(final_layer_all_timesteps, all_layers_final_timestep) = recurrent_layer(word_embeddings)

print(“\n===========================”)
print(“WITHOUT PACKING”)
print(“—————————“)
print(“Recurrent layer activation at all timesteps, for each sequence:”)
print(final_layer_all_timesteps.view(3, -1))
print(“\nWhat PyTorch gets as the recurrent layer activation at the \”final\” timestep for each sequence:”)
for (seqnum, value) in enumerate(list(all_layers_final_timestep.view(-1))):
print(“Sequence {}: {}”.format(seqnum + 1, int(value)))

QUESTION 1.1.2. [2 points]

Which timestep in a sequence is treated as “final” with packing, and which timestep is treated as “final” without packing? What has to be true about a sequence in order for these two definitions of the “final” timestep to be the same? In sequences where they are not the same, how/why does using padding without packing make the value of the recurrent layer activation at the “final” timestep misleading?

Double-click on this cell to enter your response.

Now that we have padded our data, we can use it to train the RNN. Training looks exactly the same as it does for other NN models we have seen before:

set the seed, in order to ensure results are reproducible;
create the model as an instance of the relevant class;
define an optimizer for the model, which will be used to nudge the weights during training;
train the model for a specified number of epochs.

Since calculating the gradients in RNNs involves many steps, the SGD optimizer that we used in the past is typically too simple to enable good performance. We will still use it here, but we will explore alternatives later in this notebook.

Questions 1.1.3 – 1.1.10¶
The code cell below trains a RNN classifier for sentiment analysis on the movie reviews dataset we used in Assignment 1, for which the number of units in the recurrent layer has been set to the size of an embedding (third argument embeddings.dim). Run the code cell to train the model, and then answer the following questions.

Note: it may take a few minutes to train the model

torch.manual_seed(1)
rnn_classifier = RNNClassifier(vocab, embeddings, embeddings.dim, 20)
optimizer = optim.SGD(rnn_classifier.parameters(), lr=0.01)
train(rnn_classifier, train_dataloader_rnn, optimizer, epochs=100, print_every=5, validation_data=test_dataloader_rnn)

QUESTION 1.1.3. [2 points]

Focus first on the training accuracy and training loss, presented in the second and third columns of the table above. How do these numbers change as training epochs pass? What does that mean?

Double-click on this cell to enter your response.

QUESTION 1.1.4. [2 points]

Now focus on the validation accuracy and validation loss, presented in the fourth and fifth columns. How do these numbers change as training epochs pass? How does that compare to what happens with the training accuracy and training loss? What does that suggest?

Double-click on this cell to enter your response.

To put the RNN model in context, it is helpful to compare it against a bag-of-embeddings classifier like the one we saw in Assignment 1. The following code cell creates and trains such a model. Run it, then answer the following questions.

torch.manual_seed(1)
boe_classifier = BagOfEmbeddingsBinaryClassifier(vocab, embeddings, 20)
optimizer = optim.SGD(boe_classifier.parameters(), lr=0.01)
train(boe_classifier, train_dataloader_boe, optimizer, epochs=100, print_every=5, validation_data=test_dataloader_boe)

QUESTION 1.1.5. [2 points]

How does performance on the training set differ between the RNN and BoE models? What does that mean for the potential to capture complex patterns in the RNN model as compared to the BoE model, and where does that potential come from?

Double-click on this cell to enter your response.

QUESTION 1.1.6. [2 points]

How does perform