COMP 4446 5046 Lab02
PyTorch is an open source machine learning library used for applications such as natural language processing and computer vision. It is based on the Torch library.
Before we use Pytorch it is neccessary to understand what Pytorch is. Let’s start from the core concepts: Tensor, (Computational) Graph and Automatic Differentiation
A tensor is a generalization of vectors and matrices. Vectors are one dimensional. Matrices are two dimensional. Tensors can have any number of dimensions. Tensors are the primary data structure used by neural networks. Normally, we refer to nd-tensors where nd stands for n dimensional.
There are three basic attributes we need to know about tensors:
Rank: The number of dimensions present within the tensor. e.g. rank-2 tensor means 2d-tensor.
Axes: Used to refer to a specific dimension. The number of axes is equal to the number of dimensions. The length of an axis represents the number of elements on that axis.
Shape: Formed by the length of each axis. e.g. shape(1,2) means a 2d-tensor with the first axis of length 1 and the second axis of length 2.
In PyTorch, tensors are represented by an object called torch.Tensor. This object has the following key properties :
torch.dtype: an object representing the data type of a torch.Tensor. e.g. torch.float32
torch.device: an object representing the device on which a torch.Tensor is or will be allocated. e.g. CPU or CUDA (GPU)
torch.layout: an object representing the memory layout of a torch.Tensor.
More details with illustrative examples can be found here
Computational Graph and Automatic Differentiation¶
PyTorch uses (directed acyclic) computational graphs to graph the functional operations that are applied to tensors inside neural networks so as to computationally calculate derivatives for the network optimization. In graphs, the nodes are Tensors while the edges are functions that produce output Tensors from input Tensors (e.g. summation, mutiplication). Those graphs enable PyTorch to do the automatic differentiation for us, i.e. it can automatically calculate the derivatives that are needed for network optimization. We will learn more about it through pratical examples in following sections.
Specifically, PyTorch generates the computational graph on the fly as when operations are created during forward passes in neural networks, which is refered to as a dynamic computational graph. Historically, this was one of the main differences between PyTorch and TensorFlow (which used static computational graphs, but now has an eager mode that functions more like PyTorch).
Importing the PyTorch library¶
Google Colab has the torch library installed by default so you just need to import it as shown below:
import torch
print(torch.__version__) #check version
1.13.1+cu116
Tensor creation¶
We will be implementing lots of models in PyTorch during the course. To get started, let’s have a look at how to create a tensor.
We can create tensors with numerical data, typically numpy arrays:
import numpy as np
# Scalar (0 Rank)
data = torch.tensor(1)
print(data.shape)
# Vector (1 Rank)
data = np.array([1,2])
data = torch.Tensor(data)
print(data.shape)
# Matrix (2 Rank)
data = np.ones((2,2,))
data = torch.Tensor(data)
print(data.shape)
# Cube (3 Rank)
data = np.ones((2,2,2))
data = torch.Tensor(data)
print(data.shape)
# Vector of cubes (4 Rank)
data = np.ones((2,2,2,2))
data = torch.Tensor(data)
print(data.shape)
torch.Size([])
torch.Size([2])
torch.Size([2, 2])
torch.Size([2, 2, 2])
torch.Size([2, 2, 2, 2])
Notice that although both torch.Tensor() and torch.tensor() can be used to generate tensors, there are some differences:
torch.Tensor() is the constructor of the torch.Tensor class while torch.tensor() is a factory function that constructs torch.Tensor objects and return them to the caller
torch.Tensor() can return an empty tensor without specifying incoming data while torch.tensor() with no input data will prouduce a TypeError
torch.Tensor() uses the default dtype “float32” while torch.tensor() chooses the dtype based on the incoming data (type inference). This is illustrated by the following example:
data = np.array([1,2])
print(“NumPy Data type:”, data.dtype)
data_T = torch.Tensor(data)
print(“torch.Tensor():”)
print(“Object:”, data_T)
print(“Data type:”, data_T.dtype)
data_t = torch.tensor(data)
print(“torch.tensor():”)
print(“Object:”, data_t)
print(“Data type”, data_t.dtype)
#we can also specify a datatype with torch.tensor()
data_t = torch.tensor(data, dtype=torch.float64)
print(“torch.tensor() with specified type:”)
print(“Object:”, data_t)
print(“Data type”, data_t.dtype)
# We can also check and set the default type
print(“Default type?”, torch.get_default_dtype())
# Note, the default can be set with ‘torch.set_default_dtype(dtype)’
NumPy Data type: int64
torch.Tensor():
Object: tensor([1., 2.])
Data type: torch.float32
torch.tensor():
Object: tensor([1, 2])
Data type torch.int64
torch.tensor() with specified type:
Object: tensor([1., 2.], dtype=torch.float64)
Data type torch.float64
Default type? torch.float32
As well as torch.Tensor() and torch.tensor(), we can also use torch.as_tensor and torch.from_numpy
# Create tensor using torch.as_tensor
np_data = np.ones((2,2,2))
data = torch.as_tensor(np_data)
print(“Data:”, np_data)
print(“torch.as_tensor:”)
print(“Object:”, data)
print(“Data type:”, data.dtype)
print(“Data shape:”, data.shape)
# Create tensor using torch.from_numpy
np_data = np.ones((2,2,2,2))
data = torch.from_numpy(np_data)
print(“Data:”, np_data)
print(“torch.from_numpy:”)
print(“Object:”, data)
print(“Data type:”, data.dtype)
print(“Data shape:”, data.shape)
Data: [[[1. 1.]
torch.as_tensor:
Object: tensor([[[1., 1.],
[1., 1.]],
[[1., 1.],
[1., 1.]]], dtype=torch.float64)
Data type: torch.float64
Data shape: torch.Size([2, 2, 2])
Data: [[[[1. 1.]
[1. 1.]]]]
torch.from_numpy:
Object: tensor([[[[1., 1.],
[1., 1.]],
[[1., 1.],
[1., 1.]]],
[[[1., 1.],
[1., 1.]],
[[1., 1.],
[1., 1.]]]], dtype=torch.float64)
Data type: torch.float64
Data shape: torch.Size([2, 2, 2, 2])
Alternatively, we can also create tensors without data using factory functions
# torch.eye: Returns an identity matrix of the specified size
torch.eye(2)
tensor([[1., 0.],
[0., 1.]])
# torch.zeros: Returns a tensor of given shape filled with all zeros
torch.zeros(2,2)
tensor([[0., 0.],
[0., 0.]])
# torch.ones: Returns a tensor of given shape filled with all ones
torch.ones(2,2)
tensor([[1., 1.],
[1., 1.]])
# torch.rand: Returns a tensor of given shape filled with values drawn from a uniform distribution on [0, 1).
torch.rand(2,2)
tensor([[0.3702, 0.7207],
[0.4302, 0.5925]])
More factory functions for tensor creation can be found here
Basic tensor operations¶
The list of operations with examples can be found here. Please go through and try to practise yourself with examples before you move on. You don’t need to remember all of them, you can easily refer back to the list when needed.
Simple Linear Regression¶
The following code implements a simple linear regression algorithm.
Linear Regression from scratch¶
Prepare data
import numpy
import matplotlib.pyplot as plt
import torch
# training data
x_training = numpy.asarray([1,2,5,8,9,12,14,16,18,20])
y_training = numpy.asarray([1500,3500,7200,11000,12500,18500,22000,24500,28000,30500])
x_test = numpy.asarray([3,7,13,15,19])
y_test = numpy.asarray([4400,10000,19500,23500,29000])
# creating tensors for training data and test data
x_data = torch.from_numpy(x_training)
y_data = torch.from_numpy(y_training)
x_test_data = torch.from_numpy(x_test)
y_test_data = torch.from_numpy(y_test)
Once the dataset is prepared, we can start defining our model architecture
Let’s first build it from scratch to gain a clear understanding of how automatic differentiation works:
# Define weights and biases
weight = torch.tensor(numpy.random.randn(), requires_grad=True)
bias = torch.tensor(numpy.random.randn(), requires_grad=True)
print(weight)
print(bias)
tensor(-1.3027, requires_grad=True)
tensor(0.0370, requires_grad=True)
Note that we set ‘requires_grad=True’ above, which turns on the automatic gradient computation these two objects.
Every Tensor has a flag: ‘requires_grad’ that allows for fine grained exclusion of subgraphs from gradient computation and can increase efficiency. If any input to an operation requires a gradient, its output will also require a gradient. Conversely, the output won’t require a gradient only if all inputs don’t require gradients. Backward computation is not performed in the subgraphs where all Tensors don’t require gradients.
# Define the model
# Hypothesis = W * X + b (Linear Model)
def linearRegression(x):
return x * weight + bias
# Generate predictions and compare with ground truth labels
# As we can see, because we randomly initialised the weight and bias, the model is not accurate
predictions = linearRegression(x_data)
print(predictions)
print(y_data)
tensor([ -1.2658, -2.5685, -6.4767, -10.3849, -11.6876, -15.5958, -18.2013,
-20.8067, -23.4122, -26.0177], grad_fn=
tensor([ 1500, 3500, 7200, 11000, 12500, 18500, 22000, 24500, 28000, 30500])
# Define loss function
# here we use mean squared error (MSE)
def mse(x1, x2):
diff = x1 – x2
return torch.sum(diff*diff)/diff.numel()
# Compute loss – lower is better
loss = mse(predictions, y_data)
print(loss)
tensor(3.4903e+08, grad_fn=
PyTorch automatically computed the gradient/derivative of the loss. In this case, the derivative is calculated with regard to the weight and bias because we set them to have ‘requires_grad=True’ and they are part of the graph with the loss.
All we need to do now is to call the backward() function over our loss, which will trigger the automatic computation of gradients based on the chain rule.
# Compute gradients
loss.backward()
After the backward pass, the gradients are stored in the .grad property of the tensors involved:
# Gradient for weight
print(“Current weight:”, weight)
print(“Weight gradient:”, weight.grad)
# Gradient for bias
print(“Current bias:”, bias)
print(“Bias gradient:”, bias.grad)
Current weight: tensor(-1.3027, requires_grad=True)
Weight gradient: tensor(-456588.7812)
Current bias: tensor(0.0370, requires_grad=True)
Bias gradient: tensor(-31867.2852)
Now we can adjust the weight and bias using the gradients. Why would we do that? Because it reduces the error for the example we used to compute the gradient.
# This ‘with’ block means there will be no automatic gradient computation
# We do this because we do not want gradients to be calculated for this operation
with torch.no_grad():
# We use learning rate of 1e-5 here (ie., that is how big a change we make)
weight -= weight.grad * 1e-5
bias -= bias.grad * 1e-5
print(weight)
print(bias)
tensor(3.2632, requires_grad=True)
tensor(0.3556, requires_grad=True)
We need to reset the gradients before the next forward pass, because PyTorch accumulates gradients.
with torch.no_grad():
weight.grad.zero_()
bias.grad.zero_()
print(weight.grad)
print(bias.grad)
tensor(0.)
tensor(0.)
Let’s predict and compute loss again. The loss should be lower with the new weights and biases.
predictions = linearRegression(x_data)
loss = mse(predictions, y_data)
print(loss)
tensor(3.4694e+08, grad_fn=
Hopefully you now have an initial understanding of how automatic gradient computation works.
Let’s start training the model for multiple epochs (each epoch is one pass through the data).
We do this with a python loop.
# An epoch is one iteration over the entire input data
no_of_epochs = 5000
# How often you want to display training info.
display_interval = 200
for epoch in range(no_of_epochs):
predictions = linearRegression(x_data)
loss = mse(predictions, y_data)
loss.backward()
with torch.no_grad():
weight -= weight.grad * 1e-5
bias -= bias.grad * 1e-5
weight.grad.zero_()
bias.grad.zero_()
if epoch % display_interval == 0 :
# calculate the cost of the current model
predictions = linearRegression(x_data)
loss = mse(predictions, y_data)
print(“Epoch:”, ‘%04d’ %(epoch), “loss=”, “{:.8f}”.format(loss), “W=”, “{:.4f}”.format(weight), “b=”, “{:.4f}”.format(bias))
print(“=========================================================”)
training_loss = mse(linearRegression(x_data), y_data)
print(“Optimised:”, “loss=”, “{:.9f}”.format(training_loss.data), \
“W=”, “{:.9f}”.format(weight.data), “b=”, “{:.9f}”.format(bias.data))
# Plot training data on a graph
plt.plot(x_training, y_training, ‘ro’, label=’Training data’)
plt.plot(x_training, weight.data * x_training + bias.data, label=’Linear’)
plt.legend()
plt.show()
# Calculate test loss
testing_loss = mse(linearRegression(x_test_data), y_test_data)
print(“Testing loss=”, “{:.9f}”.format(testing_loss.data))
print(“Absolute mean square loss difference:”, “{:.9f}”.format(abs(
training_loss.data – testing_loss.data)))
# Plot test data on a graph
plt.plot(x_test, y_test, ‘bo’, label=’Testing data’)
plt.plot(x_test, weight.data * x_test + bias.data, label=’Linear’)
plt.legend()
plt.show()
Epoch: 0000 loss= 344856448.00000000 W= 7.8153 b= 0.6733
Epoch: 0200 loss= 103786976.00000000 W= 690.8513 b= 48.2161
Epoch: 0400 loss= 31447782.00000000 W= 1065.0287 b= 74.0290
Epoch: 0600 loss= 9740470.00000000 W= 1270.0156 b= 87.9388
Epoch: 0800 loss= 3226474.75000000 W= 1382.3229 b= 95.3283
Epoch: 1000 loss= 1271670.75000000 W= 1443.8608 b= 99.1464
Epoch: 1200 loss= 685010.12500000 W= 1477.5861 b= 101.0082
Epoch: 1400 loss= 508870.93750000 W= 1496.0773 b= 101.7988
Epoch: 1600 loss= 455930.90625000 W= 1506.2218 b= 102.0026
Epoch: 1800 loss= 439952.81250000 W= 1511.7955 b= 101.8854
Epoch: 2000 loss= 435068.15625000 W= 1514.8646 b= 101.5925
Epoch: 2200 loss= 433512.06250000 W= 1516.5621 b= 101.2036
Epoch: 2400 loss= 432955.81250000 W= 1517.5081 b= 100.7623
Epoch: 2600 loss= 432699.06250000 W= 1518.0421 b= 100.2927
Epoch: 2800 loss= 432532.43750000 W= 1518.3507 b= 99.8077
Epoch: 3000 loss= 432393.84375000 W= 1518.5350 b= 99.3145
Epoch: 3200 loss= 432263.56250000 W= 1518.6520 b= 98.8171
Epoch: 3400 loss= 432134.81250000 W= 1518.7321 b= 98.3178
Epoch: 3600 loss= 432008.40625000 W= 1518.7894 b= 97.8173
Epoch: 3800 loss= 431881.56250000 W= 1518.8383 b= 97.3168
Epoch: 4000 loss= 431755.34375000 W= 1518.8871 b= 96.8163
Epoch: 4200 loss= 431629.06250000 W= 1518.9266 b= 96.3158
Epoch: 4400 loss= 431503.43750000 W= 1518.9618 b= 95.8155
Epoch: 4600 loss= 431378.56250000 W= 1518.9968 b= 95.3166
Epoch: 4800 loss= 431253.15625000 W= 1519.0319 b= 94.8176
=========================================================
Optimised: loss= 431128.656250000 W= 1519.066650391 b= 94.321456909
Testing loss= 219183.453125000
Absolute mean square loss difference: 211945.203125000
Linear Regression using PyTorch built-ins¶
Now we will walk through creating a linear model using a built-in Pytorch class rather than defining our own function and tensors.
Prepare some random data
import numpy
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
# training data: this time we use 2d array
# Assuming we have 90 samples of 10 features about a house condition, such as
# bedroom number, distance to city center etc. and will predict the house price
x_data = torch.randn(90, 10)
y_data = torch.randn(90, 1)
# dev data:
x_dev_data = torch.randn(10, 10)
y_dev_data = torch.randn(10, 1)
# test data:
x_test_data = torch.randn(10, 10)
y_test_data = torch.randn(10, 1)
This time we don’t need to initialize the weight and bias manually. Instead, we will define the model using the built-in torch.nn.Linear.
torch.nn is a subpackage that contains modules and extensible classes for us to build neural networks.
# Define model
linearRegression = nn.Linear(10,1)
print(linearRegression.weight)
print(linearRegression.bias)
Parameter containing:
tensor([[ 0.0723, 0.1927, -0.1910, -0.1429, -0.1481, 0.2022, -0.1815, 0.1066,
-0.1767, 0.1163]], requires_grad=True)
Parameter containing:
tensor([0.1193], requires_grad=True)
Similarly, we don’t manually update the weight and bias using gradients by ourselves. Instead, we will use the optimizer optim.SGD.
torch.optim is a subpackage that contains the standard optimization operations like Adam and SGD.
# Define optimizer
# Just pass the model parameters to be updated and specify the learning rate when calling optim.SGD
# In PyTorch, ‘SGD’ uses Mini-batch Gradient Descent with momentum (not the simplest SGD).
# In this case, we set the mini-batch size to be the size of our data,
# so it is actually Batch Gradient Descent with momentum.
optimizer = torch.optim.SGD(linearRegression.parameters(), lr=1e-5)
Again, we use the built-in loss function mse_loss instead of defining it manually.
We will need the torch.nn.functional interface, which contains common operations for building neural networks, e.g., convolution operations, activation functions, and loss functions.
# Import nn.functional
import torch.nn.functional as F
# Define the loss function
loss_func = F.mse_loss
# Calculate loss
loss = loss_func(linearRegression(x_data), y_data)
print(loss)
tensor(1.4413, grad_fn=
Now we are ready to run a loop to train the model:
# An epoch is one iteration over the entire input data
no_of_epochs = 5000
# How often you want to display training info.
display_interval = 200
for epoch in range(no_of_epochs):
predictions = linearRegression(x_data)
loss = loss_func(predictions, y_data)
loss.backward()
optimizer.step() #call step() to automatically update the parameters through our defined optimizer, which can be called once after backward()
optimizer.zero_grad() #reset the gradient as what we did before
if epoch % display_interval == 0 :
# calculate the loss of the current model
predictions = linearRegression(x_dev_data)
loss = loss_func(predictions, y_dev_data)
print(“Epoch:”, ‘%04d’ % (epoch), “dev loss=”, “{:.8f}”.format(loss))
print(“=========================================================”)
training_loss = mse(linearRegression(x_data), y_data)
print(“Optimised:”, “training loss=”, “{:.9f}”.format(training_loss.data))
training_loss = mse(linearRegression(x_dev_data), y_dev_data)
print(“Optimised:”, “dev loss=”, “{:.9f}”.format(training_loss.data))
print(“=========================================================”)
# Calculate testing loss
testing_loss = loss_func(linearRegression(x_test_data), y_test_data)
print(“Testing loss=”, “{:.9f}”.format(testing_loss.data))
print(“Absolute mean square loss difference:”, “{:.9f}”.format(abs(
training_loss.data – testing_loss.data)))
Epoch: 0000 dev loss= 2.17647552
Epoch: 0200 dev loss= 2.17041731
Epoch: 0400 dev loss= 2.16439939
Epoch: 0600 dev loss= 2.15842271
Epoch: 0800 dev loss= 2.15248728
Epoch: 1000 dev loss= 2.14659238
Epoch: 1200 dev loss= 2.14073801
Epoch: 1400 dev loss= 2.13492393
Epoch: 1600 dev loss= 2.12914991
Epoch: 1800 dev loss= 2.12341547
Epoch: 2000 dev loss= 2.11771917
Epoch: 2200 dev loss= 2.11206126
Epoch: 2400 dev loss= 2.10644293
Epoch: 2600 dev loss= 2.10086274
Epoch: 2800 dev loss= 2.09531927
Epoch: 3000 dev loss= 2.08981371
Epoch: 3200 dev loss= 2.08434677
Epoch: 3400 dev loss= 2.07891512
Epoch: 3600 dev loss= 2.07352018
Epoch: 3800 dev loss= 2.06816339
Epoch: 4000 dev loss= 2.06284094
Epoch: 4200 dev loss= 2.05755424
Epoch: 4400 dev loss= 2.05230498
Epoch: 4600 dev loss= 2.04708910
Epoch: 4800 dev loss= 2.04190922
=========================================================
Optimised: training loss= 1.381885290
Optimised: dev loss= 2.036790371
=========================================================
Testing loss= 0.715529561
Absolute mean square loss difference: 1.321260810
NLTK Library and WordNet¶
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets), each expressing a distinct concept.
In Python, NLTK (Natural Language Toolkit) includes English WordNet. NLTK is an excellent educational resource. It has many convenient modules for NLP tasks, and is well documented. However, most of its componenets are far from state-of-the-art in accuracy and speed.
To use wordnet, you need to download the wordnet data via the NLTK library
import nltk
nltk.download(‘wordnet’)
nltk.download(‘omw-1.4′)
[nltk_data] Downloading package wordnet to /root/nltk_data…
[nltk_data] Downloading package omw-1.4 to /root/nltk_data…
from nltk.corpus import wordnet as wn
Let’s get a set of synonyms that share a common meaning.
Note that we specify particular sesnes/meanings of each word here (the ’01’).
dog = wn.synset(‘dog.n.01’)
person = wn.synset(‘person.n.01’)
cat = wn.synset(‘cat.n.01’)
computer = wn.synset(‘computer.n.01’)
path_similarity()¶
path_similarity() returns a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.
print(“dog<->cat : “, wn.path_similarity(dog,cat))
print(“person<->cat : “, wn.path_similarity(person,cat))
print(“person<->dog : “, wn.path_similarity(person,dog))
print(“person<->computer : “, wn.path_similarity(person,computer))
dog<->cat : 0.2
person<->cat : 0.1
person<->dog : 0.2
person<->computer : 0.1111111111111111
Wu-Palmer Similarity (wup_similarity() )¶
wup_similarity() returns a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer / Lowest Common Ancestor (ie., the most specific word that is an ancestor of both words).
print(“dog<->cat : “, wn.wup_similarity(dog,cat))
print(“person<->cat : “, wn.wup_similarity(person,cat))
print(“person<->dog : “, wn.wup_similarity(person,dog))
print(“person<->computer : “, wn.wup_similarity(person,computer))
dog<->cat : 0.8571428571428571
person<->cat : 0.5714285714285714
person<->dog : 0.75
person<->computer : 0.5
TFIDF (Term Frequency Inverse Document Frequency)¶
TFIDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Tokenization
import nltk
nltk.download(‘punkt’)
from nltk.tokenize import word_tokenize, sent_tokenize
import numpy as np
nltk.download(‘stopwords’)
from nltk.corpus import stopwords as sw
corpus = [
‘Jonathan loves NLP. NLP teaches Jonathan’, #document 1
‘Jonathan teaches NLP’ #document 2
# Tokenize sentences – for only doc1
tokenized_sentence = sent_tokenize(corpus[0])
print(“\ntokenized_sentence: “)
print(tokenized_sentence)
# Remove punctuations – for only doc1
punc_free_doc1 = re.sub(r'[^\w\s]’,”,corpus[0])
print(“\npunc_free_sentence: “)
print(punc_free_doc1)
# Tokenize words – for only doc1
tokenized_doc1 = word_tokenize(punc_free_doc1)
print(“\ntokenized_word: “)
print(tokenized_doc1)
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokenized_doc1]
print(“\nlower_case: “)
print(lower_tokens)
# stop word removal
sww = sw.words()
tokenized_doc1 = [w for w in lower_tokens if not w in sww]
print(“\ntokenized_word (in lower case, w/o stopwords): “)
print(tokenized_doc1)
# same process for doc2
punc_free_doc2 = re.sub(r'[^\w\s]’,”,corpus[1])
tokenized_doc2 = word_tokenize(punc_free_doc2)
lower_tokens2 = [t.lower() for t in tokenized_doc2]
tokenized_doc2 = [w for w in lower_tokens2 if not w in sww]
tokenized_docs = [tokenized_doc1, tokenized_doc2]
print(“\nfinal_docs: “)
print(tokenized_docs[0])
print(tokenized_docs[1])
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
tokenized_sentence:
[‘Jonathan loves NLP.’, ‘NLP teaches Jonathan’]
punc_free_sentence:
Jonathan loves NLP NLP teaches Jonathan
tokenized_word:
[‘Jonathan’, ‘loves’, ‘NLP’, ‘NLP’, ‘teaches’, ‘Jonathan’]
lower_case:
[‘jonathan’, ‘loves’, ‘nlp’, ‘nlp’, ‘teaches’, ‘jonathan’]
tokenized_word (in lower case, w/o stopwords):
[‘jonathan’, ‘loves’, ‘nlp’, ‘nlp’, ‘teaches’, ‘jonathan’]
final_docs:
[‘jonathan’, ‘loves’, ‘nlp’, ‘nlp’, ‘teaches’, ‘jonathan’]
[‘jonathan’, ‘teaches’, ‘nlp’]
[nltk_data] Downloading package stopwords to /root/nltk_data…
[nltk_data] Unzipping corpora/stopwords.zip.
Document Frequency (DF)
N is the number of documents in the collection.
DF is how many documents in the collection contain a specific word.
df(t) = occurrence of t in documents
idf(t) = log(N/(df + 1))
for tokenized_doc in tokenized_docs:
# get each unique word in the doc – and count the number of occurrences in the document
for term in np.unique(tokenized_doc):
if term in DF:
DF[term] += 1
DF[term] = 1
{‘jonathan’: 2, ‘loves’: 1, ‘nlp’: 2, ‘teaches’: 2}
DF[‘jonathan’]
TF-IDF calculation
In the following sample code, we will use Counter to count how often each word occurs in a document. Counter is a Python class that enables counting for elements from an iterable.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import math
tf_idf = {}
# total number of documents
N = len(tokenized_docs)
doc_id = 0
# get each tokenised doc
for tokenized_doc in tokenized_docs:
# initialise counter for the doc
counter = Counter(tokenized_doc)
# calculate the total number of words in the doc
total_num_words = len(tokenized_doc)
# get each unique word in the doc
for term in np.unique(tokenized_doc):
# calculate Term Frequency
tf = counter[term]/total_num_words
# get the Document Frequency
df = DF[term]
# calculate Inverse Document Frequency with smoothing (the +1)
idf = math.log(N/(df+1))+1
# calculate TF-IDF
tf_idf[doc_id, term] = tf*idf
doc_id += 1
{(0, ‘jonathan’): 0.19817829729727854,
(0, ‘loves’): 0.16666666666666666,
(0, ‘nlp’): 0.19817829729727854,
(0, ‘teaches’): 0.09908914864863927,
(1, ‘jonathan’): 0.19817829729727854,
(1, ‘nlp’): 0.19817829729727854,
(1, ‘teaches’): 0.19817829729727854}
Sort by the importance – Descending Order
Use the Python built-in function sorted to sort the words based on their tf_idf values.
import numpy as np
#sort the dictionary based on values
dict_exmaple = tf_idf
sorted_dict = sorted(dict_exmaple.items(), key=lambda x: x[1], reverse=True)
sorted_dict
[((0, ‘jonathan’), 0.19817829729727854),
((0, ‘nlp’), 0.19817829729727854),
((1, ‘jonathan’), 0.19817829729727854),
((1, ‘nlp’), 0.19817829729727854),
((1, ‘teaches’), 0.19817829729727854),
((0, ‘loves’), 0.16666666666666666),
((0, ‘teaches’), 0.09908914864863927)]
Please complete the following two questions E1 and E2. You should submit an “ipynb” file to Canvas (When you have completed your answer here you can download it using “File” > “Download .ipynb”).
E1. What is the main limitations of using a one-hot encoding? (Full Mark: 1 mark. There is no partial mark.)¶
In order to receive full marks, please write down your answer below with a supportive example, using your own words.
#Lab01 – E1
Answer = ” In my opinion, the main limitation of one-hot encoding is that it is hard to extract meanings and further it loses the inner meaning of the word in a sentence because each word is embedded in isolation. If the vocabulary list contains ‘beef’, ‘pork’, and ‘lamb’, we can see the similarity between those words, such as the fact that they are types of meat. However, with one-hot vectors, all words are equal distance apart.” {type:”raw”}
E2. Calculate TF-IDF and search the Wiki page. (Full Mark: 1 mark. There is no partial mark.)¶
You need to complete the following ‘Exercise Requirement’
In this exercise, we will practise calculating TF-IDF using documents from wikipedia library, a Python library that makes it easy to access and parse data from Wikipedia. Based on the calculated TF-IDF values, we then search the Wiki page for the word that has the highest TF-IDF value.
## Install and import the wikipedia library