Natural Language Processing HW1

Natural Language Processing Assignment 1
Instructions
1. Due at 3 pm, February 13, 2023;
2. You can submit your answers by a single PDF with the code package or a single Jupiter notebook containing both the answers and the code;
3. For the coding questions, besides the code, you are encouraged to additionally give some descriptions of your code design and its workflow. Detailed analysis of the experimental results is also preferred;
4. Total marks are 100;

Question 1
(11 marks) We could represent the word with one-hot encoding in a language model instead of ascii representations. That is, we represent the word w with:
[0, 0, …, 1, …, 0, 0] where V is our vocabulary and ∥V ∥ is the size of it.
1. Suppose we have a vocabulary {buy,bought,girl,woman,word,words}, please represent each word using the one-hot encoding. (3 marks)
2. Combined with the previous question, what are the drawbacks of doing so? Answer at least two reasons. (4 marks)
3. Given a vocabulary {girl, woman, boy, man}, consider a computation unit with 4 nodes when we use the one-hot approach to encode the word “girl”: (4 marks)
In this case, we need 4 × 4 nodes to maintain the computation unit. Do you have any methods that could better represent the word in this vocabulary and reduce the amount of nodes to 2 × 4 ? Please give the solution and explanations.

Question 2
(24 marks) In the field of NLP, for a sequence of text S = {w1, w2, …, wT }, the statistical language model transforms the joint probability of a sequence into a product of conditional probabilities:
P (S) = P (w1, w2, …, wT ) = 􏰁 p(wt|w1, w2, …, wt−1)
However, this formulation is not useful in practice due to the huge computational costs. We prefer to use its
simplified version – the n-gram model:
p(wt|w1, w2, …, wt−1) ≈ p(wt|wt−n+1, …, wt−1)
We usually use bi-gram (n=2) or tri-gram (n=3) to model the language.
1. Given the text“the physical meanings of both the complex-valued entity and word embeddings is unknown”,
please list all the bi-gram and tri-gram forms that contains the word “meanings” or “embeddings”. (8 marks)
2. What are the disadvantages of n-gram model? (6 marks)
3. Distributed representation solves the problem of one-hot encodings by mapping each word of the original one- hot encoding to a shorter word vector, the dimension of which can be specified by ourselves during the training process according to the needs of the task. In PyTorch, nn.Embedding() represents the words using a trainable matrix with a given dimension. The code below defines a language model: (10 marks)
import torch . nn as nn
import torch .nn. functional as F
class LanguageModeler(nn.Module):
def init (self , vocab size , embedding dim, context size , hidden size=128):
super(LanguageModeler , self ). init ()
self .embeddings = nn.Embedding(vocab size , embedding dim) self.linear1 = nn.Linear(context size ∗ embedding dim, hidden size) self.linear2 = nn.Linear(hidden size , vocab size)
def forward(self , inputs):
embeds = self.embeddings(inputs).view((1, −1)) out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log probs = F.log softmax(out, dim=1)
return log probs
Please train a four-gram language model (only consider the last three words before the target word) for 10 epochs based on the text given in Question 2.1. Compare and analyze the difference of training loss with different settings of the embedding dimension {32, 64, 128}.

Question 3
(45 marks) Training an n-gram language model could be achieved by predicting the word given its n-gram elements. This idea is pretty similar to the The Continuous Bag of Words (CBOW). Given the text “There is an apple on the table”, CBOW predicts the word apple given the context of a few words before (is, an) and a few words after(on, the) the target word. In contrast, the skip-gram uses the central word apple to predict the context is, an, on, the.
The CBOW model is as follows. Given a target word wi and an N context window on each side wi−1,…,wi−N and wi+1 , …, wi+N . Referring to all context words collectively as C , CBOW tries to minimize:
− log p(wi|C) = − log Softmax(A 􏰀 qw) + b), w∈C
where qw is the embedding of word w.
1. Please use the text and Language Modeler provided in Question 2 to train a CBOW model with fixed window
2. (15 marks)
2. Please use the text and Language Modeler provided in Question 2 to train a skip-gram model. (15 marks)
3. Please use the given Wikipedia corpus and the language model to train a CBOW/Skip-gram model, and compare the word similarities among semantically closed word pairs, such as {woman, man, men}. (15 marks)

Question 4
(20 marks) Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
1. Given the text: “Peter is too old to do this work, Marry is older than Peter, John is oldest one however he is still doing this work.”, a simplest way to tokenize this text is splitting by spaces. However, there are also some disadvantages of this simple method. Please list at least two disadvantages and explain them. (5 marks)
2. Transformers models use a hybrid approach between word-level and character-level tokenization called subword tokenization. BPE(Byte-Pair Encoding) is a subword-level tokenization approach introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015). . BPE relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization. Let us assume that after pre-tokenization, the following set of words including their frequency has been determined:
(old, 10), (older, 5), (oldest, 8), (hug, 8), (pug, 4), (hugs, 5) We obtain an base vocabulary:
o,l,d,e,r,s,t,h,u,g,p
Splitting all words into symbols in the base vocabulary, we obtain: (o,l,d,10),(o,l,d,e,r,5),(o,l,d,e,s,t,8),(h,u,g,8),(p,u,g,4),(h,u,g,s,5)
BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In the above example,“o” followed by “l” is present 10+5+8 = 23 times. Thus, the first merge rule the tokenizer learns is to group all “o” symbols followed by an “l” symbol together. Next, “ol” is added to the vocabulary. The set of words then becomes:
(ol,d,10),(ol,d,e,r,5),(ol,d,e,s,t,8),(h,u,g,8),(p,u,g,4),(h,u,g,s,5)
This process will run iteratively. The vocabulary size, i.e. the base vocabulary size + the number of merges, is a hyperparameter to choose. The learned merge rules would then be applied to new words (as long as those new words do not include symbols that were not in the base vocabulary). The word not in the base vocabulary would be repersented as “[unk]”. Implement this BPE tokenizer, set the vocabulary size as 16 and train this BPE tokenizer to finish the iterative proecss. Use the trained tokenizer to tokenize the words below:(15 marks)
{old, oldest, older, nug, gug, huggingf ace}.