COMP 4446 5046 Lab3
Today we will investigate some word representation models.
import pprint
# For parsing our XML data
from lxml import etree
# For data processing
import nltk
nltk.download(‘punkt’)
from nltk.tokenize import word_tokenize, sent_tokenize
# For implementing the word2vec family of algorithms
from gensim.models import Word2Vec
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
Download data from Google Drive¶
For today’s lab we will download and use TED script data we are providing. The data is stored in Google Drive.
Option (1) Colab¶
Google Drive Access Setup
Running the following code should direct you to the Google Sign In page. Sign in with your own Google account by following the instructions on the page.
After that, the code should download the file.
Downloading TED Scripts from Google Drive
Click on left side “Files” tab to check if the file has downloaded successfully.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
id = ’17tGzZyLbz1W3xedRhhl-j5i1ndgaM_yM’
downloaded = drive.CreateFile({‘id’:id})
downloaded.GetContentFile(‘ted_en-20160408.xml’)
id = ‘1709bhW6wcZx9jnypRFNPgnY-P51OEfIJ’
downloaded = drive.CreateFile({‘id’:id})
downloaded.GetContentFile(‘ted_en-20160408.json’)
Option (2) Local¶
Download these two files:
https://drive.google.com/file/d/17tGzZyLbz1W3xedRhhl-j5i1ndgaM_yM/view?usp=sharing
https://drive.google.com/file/d/1709bhW6wcZx9jnypRFNPgnY-P51OEfIJ/view?usp=sharing
You should not need a Google Account to do this.
Data Preprocessing¶
The code below prepares our data, making several simplifications that are helpful when working with smaller datasets.
Option (1) Run Code¶
This can take a while sometimes, so if it does not finish within 5 minutes, use the json file instead.
targetXML = open(‘ted_en-20160408.xml’, ‘r’, encoding=’UTF8′)
# Get the contents of the
target_text = etree.parse(targetXML)
parse_text = ‘\n’.join(target_text.xpath(‘//content/text()’))
# Remove “Sound-effect labels” using a regular expression (regex) (i.e. (Audio), (Laughter))
content_text = re.sub(r’\([^)]*\)’, ”, parse_text)
# Break the text into sentences using the NLTK library
sent_text = sent_tokenize(content_text)
# Remove punctuation and change all characters to lowercase
normalized_text = []
for string in sent_text:
tokens = re.sub(r”[^a-z0-9]+”, ” “, string.lower())
normalized_text.append(tokens)
# Tokenise each sentence to process individual words
sentences=[]
sentences=[word_tokenize(sentence) for sentence in normalized_text]
# Print a sample of 10 (tokenised) sentences
print(sentences[:10])
—————————————————————————
KeyboardInterrupt Traceback (most recent call last)
10 # Break the text into sentences using the NLTK library
—> 11 sent_text = sent_tokenize(content_text)
13 # Remove punctuation and change all characters to lowercase
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/__init__.py in sent_tokenize(text, language)
105 “””
106 tokenizer = load(f”tokenizers/punkt/{language}.pickle”)
–> 107 return tokenizer.tokenize(text)
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in tokenize(self, text, realign_boundaries)
1274 Given a text, returns a list of the sentences in that text.
1275 “””
-> 1276 return list(self.sentences_from_text(text, realign_boundaries))
1278 def debug_decisions(self, text):
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in sentences_from_text(self, text, realign_boundaries)
1330 follows the period.
1331 “””
-> 1332 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1334 def _match_potential_end_contexts(self, text):
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in
1330 follows the period.
1331 “””
-> 1332 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1334 def _match_potential_end_contexts(self, text):
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in span_tokenize(self, text, realign_boundaries)
1320 if realign_boundaries:
1321 slices = self._realign_boundaries(text, slices)
-> 1322 for sentence in slices:
1323 yield (sentence.start, sentence.stop)
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in _realign_boundaries(self, text, slices)
1419 “””
1420 realign = 0
-> 1421 for sentence1, sentence2 in _pair_iter(slices):
1422 sentence1 = slice(sentence1.start + realign, sentence1.stop)
1423 if not sentence2:
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in _pair_iter(iterator)
316 iterator = iter(iterator)
317 try:
–> 318 prev = next(iterator)
319 except StopIteration:
320 return
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)
1393 def _slices_from_text(self, text):
1394 last_break = 0
-> 1395 for match, context in self._match_potential_end_contexts(text):
1396 if self.text_contains_sentbreak(context):
1397 yield slice(last_break, match.end())
/usr/local/lib/python3.8/dist-packages/nltk/tokenize/punkt.py in _match_potential_end_contexts(self, text)
1378 continue
1379 # Find the word before the current match
-> 1380 split = text[: match.start()].rsplit(maxsplit=1)
1381 before_start = len(split[0]) if len(split) == 2 else 0
1382 before_words[match] = split[-1] if split else “”
KeyboardInterrupt:
Option (2) Use json file¶
import json
sentences = json.load(open(“ted_en-20160408.json”))
# Print a sample of 10 (tokenised) sentences
print(sentences[:10])
[[‘here’, ‘are’, ‘two’, ‘reasons’, ‘companies’, ‘fail’, ‘they’, ‘only’, ‘do’, ‘more’, ‘of’, ‘the’, ‘same’, ‘or’, ‘they’, ‘only’, ‘do’, ‘what’, ‘s’, ‘new’], [‘to’, ‘me’, ‘the’, ‘real’, ‘real’, ‘solution’, ‘to’, ‘quality’, ‘growth’, ‘is’, ‘figuring’, ‘out’, ‘the’, ‘balance’, ‘between’, ‘two’, ‘activities’, ‘exploration’, ‘and’, ‘exploitation’], [‘both’, ‘are’, ‘necessary’, ‘but’, ‘it’, ‘can’, ‘be’, ‘too’, ‘much’, ‘of’, ‘a’, ‘good’, ‘thing’], [‘consider’, ‘facit’], [‘i’, ‘m’, ‘actually’, ‘old’, ‘enough’, ‘to’, ‘remember’, ‘them’], [‘facit’, ‘was’, ‘a’, ‘fantastic’, ‘company’], [‘they’, ‘were’, ‘born’, ‘deep’, ‘in’, ‘the’, ‘swedish’, ‘forest’, ‘and’, ‘they’, ‘made’, ‘the’, ‘best’, ‘mechanical’, ‘calculators’, ‘in’, ‘the’, ‘world’], [‘everybody’, ‘used’, ‘them’], [‘and’, ‘what’, ‘did’, ‘facit’, ‘do’, ‘when’, ‘the’, ‘electronic’, ‘calculator’, ‘came’, ‘along’], [‘they’, ‘continued’, ‘doing’, ‘exactly’, ‘the’, ‘same’]]
Word2Vec – Continuous Bag-Of-Words (CBOW)¶
For more details about gensim.models.word2vec you can refer to the API for Gensim Word2Vec
# Initialize and train a word2vec model with the following parameters:
# sentence: iterable of iterables, i.e. the list of lists of tokens from our data
# size: dimensionality of the word vectors
# window: window size
# min_count: ignores all words with total frequency lower than the specified count value
# workers: Use the specified number of worker threads to train the model (ie., enable faster training on multicore machines)
# sg: training algorithm, 0 for CBOW, 1 for skip-gram
wv_cbow_model = Word2Vec(sentences=sentences, size=100, window=5, min_count=5, workers=2, sg=0)
# The trained word vectors are stored in a KeyedVectors instance as model.wv
# Get the 10 most similar words to ‘man’ by calling most_similar()
# most_similar() computes cosine similarity between a simple mean of the vectors of the given words and the vectors for each word in the model
similar_words = wv_cbow_model.wv.most_similar(“man”) # topn=10 by default
pprint.pprint(similar_words)
[(‘woman’, 0.8533000946044922),
(‘guy’, 0.8113319277763367),
(‘lady’, 0.7864135503768921),
(‘boy’, 0.7814817428588867),
(‘girl’, 0.7682338953018188),
(‘gentleman’, 0.7450353503227234),
(‘soldier’, 0.7398906946182251),
(‘kid’, 0.7111135721206665),
(‘friend’, 0.6803527474403381),
(‘photographer’, 0.6772572994232178)]
Word2Vec – Skip Gram¶
# Now we switch to a Skip Gram model by setting parameter sg=1
wv_sg_model = Word2Vec(sentences=sentences, size=100, window=5, min_count=5, workers=2, sg=1)
similar_words = wv_sg_model.wv.most_similar(“man”)
pprint.pprint(similar_words)
[(‘woman’, 0.7846474647521973),
(‘rabbi’, 0.7115237712860107),
(‘soldier’, 0.7067371606826782),
(‘guy’, 0.6972269415855408),
(‘boy’, 0.6956959962844849),
(‘pianist’, 0.6854783296585083),
(‘michelangelo’, 0.6789897680282593),
(‘testament’, 0.6779104471206665),
(‘dancer’, 0.6662606596946716),
(‘gentleman’, 0.6645552515983582)]
Word2Vec vs FastText¶
Word2Vec – Skip Gram cannot find similar words to “electrofishing” as “electrofishing” is not in the vocabulary.
similar_words =wv_sg_model.wv.most_similar(“electrofishing”)
pprint.pprint(similar_words)
# Note: When run, this code should produce an error.
—————————————————————————
KeyError Traceback (most recent call last)
—-> 1 similar_words =wv_sg_model.wv.most_similar(“electrofishing”)
2 pprint.pprint(similar_words)
4 # Note: When run, this code should produce an error.
/usr/local/lib/python3.8/dist-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
529 mean.append(weight * word)
530 else:
–> 531 mean.append(weight * self.word_vec(word, use_norm=True))
532 if word in self.vocab:
533 all_words.add(self.vocab[word].index)
/usr/local/lib/python3.8/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
450 return result
451 else:
–> 452 raise KeyError(“word ‘%s’ not in vocabulary” % word)
454 def get_vector(self, word):
KeyError: “word ‘electrofishing’ not in vocabulary”
FastText – Skip Gram¶
Now we will switch to FastText, which can handle out-of-vocabulary words (OOV).
from gensim.models import FastText
# We initialize and train FastText with a Skip Gram architecture (sg=1)
ft_sg_model = FastText(sentences, size=100, window=5, min_count=5, workers=2, sg=1)
# Now, try ‘electrofishing’ and we will see that FastText allows us to obtain word vectors for out-of-vocabulary words
result = ft_sg_model.wv.most_similar(“electrofishing”)
pprint.pprint(result)
[(‘electrolux’, 0.8027057647705078),
(‘electro’, 0.7873678803443909),
(‘electrolyte’, 0.7839568853378296),
(‘electric’, 0.7741890549659729),
(‘airbus’, 0.7718110084533691),
(‘electroshock’, 0.7662405967712402),
(‘gastric’, 0.7549958229064941),
(‘electrochemical’, 0.754688560962677),
(‘electrogram’, 0.7543359994888306),
(‘electrons’, 0.7532363533973694)]
FastText – Continuous Bag-Of-Words (CBOW)¶
# Now we initialize and train FastText with CBOW architecture (sg=0)
ft_cbow_model = FastText(sentences, size=100, window=5, min_count=5, workers=2, sg=0)
# Again, FastText allows us to obtain word vectors for out-of-vocabulary words
result = ft_cbow_model.wv.most_similar(“electrofishing”)
pprint.pprint(result)
[(‘electric’, 0.9155791997909546),
(‘electro’, 0.9024492502212524),
(‘electrolux’, 0.8950530290603638),
(‘electron’, 0.8861533403396606),
(‘electronic’, 0.8847971558570862),
(‘electrolyte’, 0.8803984522819519),
(‘electroshock’, 0.8720880746841431),
(‘electrode’, 0.8711128234863281),
(‘electrical’, 0.8702925443649292),
(‘electromagnet’, 0.863209068775177)]
King – Man + Woman = ?¶
Try using both CBOW and Skip Gram models to calculate “King – Man + Woman = ?”
# We can specify the positive/negative word list with the positive/negative parameters to create a word expression
# Top N most similar words can be specified with the topn parameter
result = wv_cbow_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
[(‘president’, 0.7592150568962097)]
result = wv_sg_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
[(‘luther’, 0.6705532073974609)]
result = ft_cbow_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
[(‘kidding’, 0.89261394739151)]
result = ft_sg_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
[(‘pauling’, 0.6854696869850159)]
This is not what we expected…
We are using relatively little data to form our embeddings. That means the representations are not going to be as good.
In the next section, we will try again with a larger training dataset. Training ourselves would be too compute intensive, so we will use vectors Google trained using Google News data.
Using Pretrained word embeddings with Gensim¶
1.Download and load from Google pretrained Word2Vec binary file¶
In case you are interested, here are links to the original code for word2vec:
The Original Project
GitHub Port of the original
Another GitHub Repo with an effort to update compatibility with other systems
Option (1) Colab¶
# Download the pre-trained vectors trained on part of the Google News dataset (about 100 billion words)
# Beware, this file is big (3.39GB) – you might need to wait a while!
id2 = ‘0B7XkCwpI5KDYNlNUTTlSS21pQmM’
downloaded = drive.CreateFile({‘id’:id2})
downloaded.GetContentFile(‘GoogleNews-vectors-negative300.bin.gz’)
Option (2) Local¶
Download the file from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing and save it locally.
You should not need a Google Account to do this.
Processing the data¶
# Decompress the downloaded file
!gzip -d /content/GoogleNews-vectors-negative300.bin.gz
from gensim.models import KeyedVectors
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
# Load the pretrained vectors with KeyedVectors instance
# Note that we set the limit=100000 here, which means we set a maximum number of word-vectors to read from the file, to avoid out of memory issues and to load the vectors faster.
filename = ‘GoogleNews-vectors-negative300.bin’
gn_wv_model = KeyedVectors.load_word2vec_format(‘GoogleNews-vectors-negative300.bin’, binary=True, limit=100000)
# Now we can try to calculate “King – Man + Woman = ?” again
result = gn_wv_model.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
[(‘queen’, 0.7118192911148071)]
# Let’s also try to extract two example word embeddings and check their shape
wv_banana = gn_wv_model[“banana”]
wv_avocado = gn_wv_model[“avocado”]
print(wv_banana.shape)
print(wv_avocado.shape)
# We can also calculate the cosine similarity ourselves with the extracted words
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([wv_banana],[wv_avocado])
array([[0.5332662]], dtype=float32)
2.Load a pretrained word embedding model using the Gensim API¶
The following code illustrates another way of loading pretrained word embeddings with Gensim. Here we try a GloVe embedding trained on Twitter data.
import gensim.downloader as api
# download the model and return it as an object ready for use
model = api.load(“glove-twitter-25”)
# The similarity() function can calculate the cosine similarity between two words
print(model.similarity(“cat”,”dog”))
# The distance() function returns (1 – cosine similarity), which can be useful in some cases
print(model.distance(“cat”,”dog”))
[==================================================] 100.0% 104.8/104.8MB downloaded
0.95908207
0.040917932987213135
[Tips] Play with Colab Form Fields¶
The Form supports multiple types of fields, including input fields, dropdown menus.
In Lab2 E1, we already used the input fields. Let’s try more now. You can edit this section by double-clicking it.
Let’s get familiar by changing the value in each input field (on the right) and checking the changes in the code (on the left) – and vice versa
Example form fields
please insert a description
string = ‘examples’ {type: “string”}
slider_value = 117 {type: “slider”, min: 100, max: 200}
number = 102 {type: “number”}
date = ‘2020-01-05’ {type: “date”}
pick_me = “monday” [‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’]
select_or_input = “apples” [“apples”, “bananas”, “oranges”] {allow-input: true}
#print the output
print(“string is”,string)
print(‘slider_value’,slider_value)
string is examples
slider_value 117
Extension¶
Word Embedding Visual Inspector (WEVI)¶
If you would like to visualise how Word2Vec is learning, the following link is useful https://ronxin.github.io/wevi/