FALL 2023

Homework 4 Part 2
Attention-based End-to-End Speech-to-Text Deep Neural Network
11-785: Introduction to Deep Learning (Fall 2023)
OUT: November 6, 2023, at 11:59 PM ET
EARLY SUBMISSION (Canvas Quiz): November 16, 2023, at 11:59 PM ET DUE: December 2, 2023, at 11:59 PM ET
Start Here
• Collaboration policy:
– You are expected to comply with the University Policy on Academic Integrity and Plagiarism.
– You are allowed to talk with / work with other students on homework assignments
– You can share ideas but not code, you must submit your own code. All submitted code will be compared against all code submitted this semester and in previous semesters using MOSS.
• Submission:
You need to go through the writeup and finish the Canvas quiz by the early submission deadline.
You need to implement what is mentioned in the write-up (although you are free and encouraged to explore further) and submit your results to Kaggle . We will share a Google form or an Autolab link after the Kaggle competition ends for you to submit your code.
In this homework you will work on a sequence-to-sequence conversion problem, in which you must train models to transcribe speech recordings into word sequences, spelled out alphabetically.
As in previous HWP2s, this homework too is conducted as a Kaggle competition. You can access the kaggle for the homework at this link.
You can download the training and test data directly from Kaggle using their API or this link. Your download will include the following:
∗ There are two sets of training data, train-clean-360 and train-clean-100. They both include data pairs comprising of sequences of audio feature vectors and their transcriptions, in mfcc and transcripts folders respectively.
∗ The validation data will be similarly found in the dev-clean folder.
∗ The test data will be found in the test-clean folder. It will only contain an mfcc folder.
You must use the train data to train the model. Data in the dev-clean folder are to be used only for validation and not for training.
You must transcribe the utterances in the test-clean folder and write them out into submission.csv following the format already specified in the file, and upload your results to Kaggle.
You will be graded by your performance. Additional details will be posted on piazza.

Homework objectives
If you complete this homework successfully, you would ideally have learned:
• To implement an attention-based system to solve a sequence-to-sequence problem.
– How to setup an encoder
∗ You will have an idea of synchrony/rate principles to consider in setting up the encoder ∗ How to setup the transformer encoder
– How to setup a decoder
– How to setup attention
• You would have learned how to address some of the implementation details
– How to setup a teacher forcing schedule
– How to pad-pack the variable length data
• To explore architectures and hyperparameters for the optimal solution
– To identify and tabulate all the various design/architecture choices, parameters and hyperparam- eters that affect your solution
– To effectively explore the search space and strategically finding the best solution
• As a side benefit you would also have learned how to construct a speech recognition system using the
attention-based encoder-decoder architecture
Important: A note on presentation style and pseudocode
The following writeup is intended to be reasonably detailed and explains several concepts through pseu- docode. However there is a caveat; the pseudocode is intended to be illustrative, not exemplary. It conveys the concepts, but you cannot simply convert it directly to python code. For instance, most of the pro- vided pseudocode is in the form of functions, whereas your python code would use classes (in fact, we’ve written the pseudocode as functions with the express reason that you cannot simply translate it to your code. Function-based DL code will be terribly inefficient. You must use classes). Please write reusable code wherever possible, and avoid redundant calculations to make your code more efficient.
Nonetheless, we hope that when you read the entire write-up, you will get a reasonable understanding of how to implement (at least the baseline architecture of) the homework.

Table of Contents
1 Introduction 6
1.1 Overview ……………………………………….. 6 1.2 Problemspecifics……………………………………. 7 1.3 GettingStartedEarly …………………………………. 8
2 Dataset 6
2.1 Dataloader ………………………………………. 8
3 Baseline Approach 9
3.1 Model…………………………………………. 9 3.1.1 Listener(Encoder)……………………………….. 10 3.1.2 Attention(Attend)……………………………….. 11 3.1.3 Speller(Decoder)………………………………… 12
3.2 Inference………………………………………… 15 3.2.1 GreedySearch …………………………………. 15 3.2.2 RandomSearch…………………………………. 16 3.2.3 BeamSearch ………………………………….. 16 3.2.4 LevenshteinDistance………………………………. 16
3.3 Training………………………………………… 16 3.3.1 TeacherForcing…………………………………. 17 3.3.2 Cross-EntropyLosswithPadding………………………… 19 3.3.3 Overalltraining…………………………………. 19
4 Practical Implementation Details
5 Conclusion
Appendices

Here is a checklist page that you can use to keep track of your progress as you go through the write-up and implement the corresponding sections in your starter notebook. As you complete your unit tests for each component to verify your progress, you can check the corresponding boxes aligned with each section and go through this writeup and starter notebook simultaneously step by step.
1. Read Introduction
2. Read the writeup, and finish the Canvas quiz 3. Dataset and Dataloader
Dataset description
Refer to HW0 and implement Dataset class
4. Encoder
Listener description
Implement transformer encoder
Implement Encoder
5. Attention
Attention description
Implement Attention computation
6. Decoder
Speller description
Implement Decoder
7. Training
Training description
Implement Teacher Forcing
Cross-entropy Loss with padding
8. Inference
Implement Greedy Decoding
Implement Random Decoding
Implement Beam Decoding
Implement Levenshtein Distance(Also done in HW3P2)
Code Help, Add WeChat: cstutorcs
1 Introduction
Key new concepts: Sequence-to-sequence conversion, Encoder-decoder architectures, Attention.
Mandatory implementation: Encoder-decoder architecture, Attention. Your solution must implement an encoder-decoder architecture with attention. While it may be possible to solve the homework using other architectures, you will not get marks for implementing those.
Restrictions: You may not use any data besides that provided as part of this homework. You may not use the validation data for training. You cannot use an implementation from PyTorch or any other third-party package for implementing any type of Attention (and optionally beam search).
1.1 Overview
In this homework you will learn to build neural networks for sequence-to-sequence conversion or transduction.
“Sequence-to-sequence” conversion refers to problems where a sequence of inputs goes into the system, which emits a sequence of outputs in response. The need for such conversion arises in many problems, e.g.
• Speech recognition: A sequence of speech feature vectors is input. The output is a sequence of charac- ters writing out what was spoken.
• Machine translation: A sequence of words in one language is input. The output is a sequence of words in a different language.
• Dialog systems: A user’s input goes in. The output is the system’s response.
For this homework, we will work on the speech recognition problem. Unlike HW1P2 and HW3P2, where the task was to predict phonemes in the speech, in this homework, we will learn to directly spell the spoken sentence out in the English alphabet.
A key characteristic of speech recognition problems is that there may be no obvious correspondence between the input and the output sequences. For instance, in a dialog system, a user input “my screen is blank” may elicit the output “check the power switch”. In a speech recognition system the input may be the (feature vector sequence for the) spoken recording for the phrase “know how”. The system output must be the character sequence “k”, “n”, “o”, “w”, “ ”, “h”, “o”, “w” (note the blank space “ ” between know and how – the output is supposed to be readable, and must include blank spaces as characters). There is no obvious audio corresponding to the silent characters “k” and “w” in know, or the blank space character between know and how, yet these characters must nevertheless be output.
As a consequence of the absence of correspondence between the input and output sequences, the usual “vertical” architectures that we have used so far in our prior homeworks (Figure 1a), where the input is sequentially processed by layers of neurons until the output is finally computed, can no longer be used.
Instead, we must use a two-component model, comprising two separate network blocks, one to process the input, and another to compute the output. Such architectures are called encoder-decoder architectures. The encoder processes the input sequence to compute feature representations (embeddings) of the input sequence. The decoder subsequently uses the input representations computed by the encoder to sequentially compute the output de novo, i.e. from scratch (Figure 1b).
The most important part of the model is the attention mechanism. Although the decoder’s output may not have a direct correspondence with the input, each individual output symbol (character) nonetheless relates more to some parts of the input than others, e.g. the blank character “ ” in our above speech recognition example corresponds primarily to the audio at the boundary between the words “know” and “how” in the recording. When outputting the “ ”, the decoder must somehow know to “focus” on, or “pay attention” to this region of the input.
In this homework you will learn how to build an encoder to effectively extract features from a speech signal, how to construct a decoder that can sequentially spell out the transcription of the audio, and how to implement an attention mechanism between the decoder and the encoder.

(a) Vertical architecture (b) Encoder-decoder architecture
Figure 1: (a) Standard “vertical” architectures we have seen in previous homeworks. The input is sequentially passed through several layers until the output is computed from the final layer. This kind of architecture is not useful for sequence-to-sequence problems with no input-output correspondence, such as the problem we deal with in this homework. (b) General encoder-decoder architecture, comprising two separate modules, an encoder to process the input and a decoder to generate the output. The two are linked through an attention module.
Although the specifics of the homework will be targeted towards speech recognition, please note that the general framework (with appropriate modification of the encoder and decoder architectures) should also apply to other types of sequence-to-sequence problems, or even to image or video captioning (you would only need to modify the encoder to derive features from images or video respectively).
1.2 Problem specifics
In this problem of speech recognition you will learn how to convert a sequence of feature vectors of speech to a sequence of characters that spells out its transcription. For this, you will have to implement the following:
• An encoder that can derive high level feature vectors from the input sequence of speech vectors. This homework derives a sequence of features from the input.
• A decoder to produce the actual output sequence. The decoder for this homework is a conditional language model that depends on the features extracted from the encoder(conditioned on the input).
• Attention. To generate each output, the decoder receives, as input, a “context”, comprising a weighted sum of the sequence of representation vectors computed by the encoder. Attention is the manner in which the decoder derives information from the input for conditioning the language model. You will explore ways of computing the attention weights such that they automatically learn to pay “attention” to the most relevant parts of the input for each output.
• Training. And last, but not the least you will learn how to train a complicated model that has all of the above components to do your bidding. One of the consequences of the probabilistic nature of decoding is asynchrony between the actual and target outputs of the decoder. You will also explore ways of dealing with this through a mechanism called “teacher forcing”, and the effect of various optimization strategies and schedules on your models.
We provide you a baseline architecture based on “Listen, Attend and Spell”, William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals and “Attention Is All You Need”, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (which we highly recommend that you read), and describe the key components of it later in Section 4. But you are also welcome to attempt other variations (for any of the three components – encoder, decoder or attention) provided they fall within the Encoder-Decoder+Attention framework.
Believe it or not, although this may seem like a lot, you will manage to get all this done in the course of this

homework. In the following sections, we will outline how to complete the work.
1.3 Getting Started Early
Uniquely, HW4P2 weaves together your work from HW3P2 and HW4P1. As a result, it can, and most likely will take a significant amount of your time compared to previous homeworks, so please start early. Not only is implementing this HW harder than previous ones (with more moving parts, you are more likely to encounter errors, which will take significant time to identify and debug), the models also typically take longer to converge and produce good results.
The rest of this writeup is structured as follows – Section 3 describes the dataset and dataloader. In Section 4 we outline the baseline model, including the individual components (encoder, decoder and attention), how to perform inference/a forward pass with it, and how to train it. Section A in the Appendices outline debugging and specific implementation details.
2 An important note on Attention
Since the key feature of this homework is the attention mechanism, it is useful to first explain it briefly. You will need this information for your homework. You have already encountered attention in Lecture 18. We assume that you have an idea of attention, and the concepts behind it. If not, we recommend that you review lectures 18 and 19 and recitation 10. Assuming this prior knowledge, we quickly recap the key concepts of attention that apply to this homework problem before we continue.
2.1 An attention head
The encoder in an encoder-decoder model takes in an input sequence of vectors X0, · · · , XN−1 and produces a sequence of embeddings, E0, · · · , EM−1. The length of the embedding sequence M need not be the same as that of the input sequence, N, but each embedding vector Ei effectively represents a section of the input. Assuming all the embeddings to be row vectors (in the standard python format), the set of embedding vectors can be vertically stacked into a matrix E.
In the typical implementation of attention, from the embeddings Ei the encoder computes two terms, a key Ki and a value Vi. Typically these are computed through a linear transform. Stacking all the key vectors (assumed to be row vectors) vertically into a matrix K, and the value vectors into a matrix V, these can be computed as V = EWV and K = EWK , where WV and WK are learnable parameters.
In the decoder, corresponding to each ouput Ok, we first compute a query vector qk, which is used to “query” the encoder outputs to determine which portion of the input to pay most attention to, to generate Ok. When the decoder is a recurrent network, the query is typically computed as a linear transform of the recurrent state sk−1, and optionally also Ok−1: qk = sk−1WQ, or qk = [sk−1,Ok−1]WQ where WQ is a learnable parameter, Ok−1 is the (embedding or one-hot representation of the) (k − 1)th output, and [sk−1, Ok−1] represents the concatenation of sk−1 and Ok−1.
In order to produce the kth output symbol Ok, the decoder computes a context Ck from the encoder- derived value vectors: Ck = wkV. wk is a vector of attention weights for the kth output, computed as wk = softmax(qkK⊤), and has the property that (a) it has as many components as the number of vectors in E, (b) all of its components are non-negative, and (c) they sum to 1.0. Ideally, the components of wk corresponding to Value vectors representing the region of the input most relevant to Ok will be high, and the rest will be low. The context Ck is input to the decoder (along with any other recurrent state values and previous outputs that your architecture considers) to generate Ok.
2.2 Multi-head attention
The attention computed using the above method (Section 2.1) computes a single context vector Ck to compute output Ok. Ck derives and focuses on a single specific aspect of the input that is most relevant to

Ok. It is reasonable to assume that there are, in fact, multiple separate aspects of the input that must all be considered to compute Ok.
Multi-head attention works on this principle to derive multiple context vectors Ck1, Ck2, · · · , CKL . Each context vector Ckl is computed using a separate attention module of the form explained above (Section 2.1), with its own learnable parameters WKl ,WVl ,andWQl . The complete module that computes an individual context vector Ckl is called an attention head, thus the overall attention framework that computes multiple context vectors is referred to as multi-head attention.
The final context vector Ck that is used to compute Ok is obtained by concatenating the individual context vectors: Ck = [Ck1, Ck2, · · · , CkL] (where L is the number of attention heads).
In this homework you will also investigate the relative advantages of multi-head attention over simple atten- tion and the benefits to be obtained through increasing the number of attention heads.
2.2.1 Transformer Encoder
The transformer encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. The encoder consists of a stack of identical layers ( 6 in the original paper Attention Is All You Need), where each layer is composed of two sublayers.
1. The first sublayer implements a multi-head self-attention mechanism as explained in Lecture 19. The multi-head mechanism implements heads that receive a (different) linearly projected version of the queries, keys, and values, each to produce outputs in parallel that are then used to generate a final result.
2. The second sublayer is a fully connected feed-forward network consisting of two linear transformations with Rectified Linear Unit (ReLU) activation in between.
FFN =ReLU(W1x+b1)W2 +b2
The layers of the Transformer encoder apply the same linear transformations to all the words in the input sequence, but each layer employs different weight () and bias () parameters to do so.
Furthermore, each of these two sublayers has a residual connection around it.
Each sublayer is also succeeded by a normalization layer, layernorm(.), which normalizes the sum computed between the sublayer input, x, and the output generated by the sublayer itself, sublayer(x):
Layernorm(x + sublayer(x))
The Transformer architecture cannot inherently capture any information about the relative positions of the words in the sequence since it does not make use of recurrence.This information has to be injected by introducing positional encodings to the input embeddings.
2.2.2 Position Encoding
Positional encoding describes the location or position of an entity in a sequence so that each position is assigned a unique representation. There are many reasons why a single number, such as the index value, is not used to represent an item’s position in transformer models. For long sequences, the indices can grow large in magnitude. If you normalize the index value to lie between 0 and 1, it can create problems for variable length sequences as they would be normalized differently. There are several requirements for the positional encodings such as; they should have some representation of time, they should be unique for each position and they should be bounded.
In the original transformer paper, they use a smart positional encoding scheme, where each position/index is mapped to a vector. Hence, the output of the positional encoding layer is a matrix, where each row of the matrix represents an encoded object of the sequence summed with its positional information.

Figure 2: Adopted from machinelearningmastery
2.2.3 Positional Encoding Layer in Transformers
Suppose you have an input sequence of length L and require the position of the kth object within this sequence. The positional encoding is given by sine and cosine functions of varying frequencies:
k P (k, 2i) = sin n2i/d
k P(k,2i+1)=cos n2i/d
k: position of an object in the input sequence, 0 ≤ k ≤ l/2
d: dimension of the output embedding
P(k, j): Position function for mapping a position k in the input sequence to index (k, j) of the positional matrix.
n: User-defined scalar, set to 10,000 by the authors of Attention Is All You Need.
i: Used for mapping to column indices , with a single value of maps to both sine and cosine functions.
In the above expression, you can see that even positions correspond to a sine function and odd positions correspond to cosine functions.
Listing 1 Positional Encoding
function P = PositionEncoding(seq_len, d, n)
P = # Initialize P(seq_len, d) matrix to hold the positional encoding
for t =1:seq_len
for i = 1:d/2
denominator = # compute the denominator
P[k, 2i] = # compute the positional embedding using sin
P[K, 2i+1] = # compute the positional embedding using cos

You will be working on a similar dataset as in HW1P2 and HW3P2. The training set will comprise input- output pairs, where each input consists of a sequence of 28-dimensional mel-spectral vectors derived from a spoken recording, and the corresponding output consists of the spelled out text transcription. The tran- scriptions are in terms of 31 characters (the 26 characters in the English alphabet, blank space, “,” [comma] and three special tokens: “”,“” and “”, representing the start and end of a sentence and padding respectively).
The audio in the different training instances are of different lengths, and the output sequences too are (obviously) of different lengths. We also provide you a Python array VOCABULARY (refer to the starter notebook) which contains a mapping from the letters to numeric indices. You must use this list to convert letters from transcriptions to their indices for the purposes of the competition.
3.1 Dataloader
You will need to implement the Dataset class for this HW by your own. You can refer to the dataset implementation in HW3P2 to help you here (they are very similar). We also recommend using cepstral mean normalization to normalize input cepstral vector sequences for this HW as provided in HW1P2 (although this is not mandatory). Although no output labels are provided for test data during inference, we still have to worry about length – when test data are batched, the outputs for the different inputs in a batch of test data will all also be of different lengths. We recommend using a maximum length of 550 for generating your outputs for test and validation data, although you are free to experiment with it. More details on how to handle variable data length are in Appendix D.
4 Baseline Approach
The baseline model for this assignment is derived from the paper “Listen, Attend and Spell” (LAS). However, we shall replace the encoder in LAS with the transformer encoder as described in the paper Attention Is All You Need. The LAS paper describes an encoder-decoder approach: The encoder in this model is called a Listener since it “listens” to the audio, the decoder, which spells out the transcription, is called the Speller, and Attend of course refers to the attention module. The objective of this homework is to learn all components, jointly.
A brief summary of the components of LAS are as follows:
• The Listener consists of a transformer encoder structure that takes in the audio feature vector sequences of the given utterances and outputs a sequence of high level representation vectors.
• The Speller takes in the high-level feature output from the Listener network and uses it to compute a probability distribution over sequences of characters using the attention mechanism.
• Attention can be intuitively understood as trying to learn a mapping from a word vector to some regions of the utterance map. The Listener produces a high-level representation of the given utterance and the Speller uses parts of that representation using Attention to predict the next word in the sequence.
For this HW, we elaborate on a variant of LAS you can implement as your model because we believe that the LAS paper has some unclear explanations. Hence, we have explained and provided additional information for your learning. We do recommend following this approach for the HW, as it is in line with the starter code template provided. However, you are free to develop your own model, or write a notebook from scratch.
We outline the model components, inference and training procedures below.
The model has three components – the Listener (encoder), the attention module, and the Speller (decoder).
Computer Science Tutoring
4.1.1 Listener (Encoder)
The Listener is this model extracts meaningful latent representations from the speech that is used by the decoder to predict characters in the english alphabet.
The encoder takes in an input sequence of vectors X0, · · · , XN−1 and produces a sequence of latent repre- sentations (or embeddings), E0, · · · , EM−1. The length of the embedding sequence M need not be the same as that of the input sequence, N, but each embedding vector Ei effectively represents a section of the input.
In essence, you are constructing an acoustic model encoder takes into consideration the nature of the input feature vector sequence and rate difference between the input and output of the speech recognizer.
The input sequence of speech feature vectors exhibit both short-term structural relations, and long-term contextual dependence. The listener typically tries to capture these dependencies. There is also an approx- imate factor of 8 difference between the rate of the feature vector sequences in the audio (100 vectors per second) and their spelling (on average 12-15 characters per second). So the listener includes components that effectively lower the rate of the latent representations, typically by a factor of 8 to match the ouput.
3.1.1.2 A typical listener
Here is pseudocode for a typical complete Listener that reduces the rate by a factor of 2: Listing 2 Listener
# X = (batchsize, length, dim) array, which is a minibatch of input sequences
# The output is minibatch of embedding sequences (for the entire input minibatch).
function E = Listener(X)
O1 = BiLSTM(X, BLSTMwidth, BLSTMparams)
02 = 1DCNN(01, stride =2, CNNparams) #Add 1DCNNs E = TransformerEncoder(O2, num_heads, dimension) return E
This is only an example; you must devise your own architecture. Use what you have learnt about the transformer to implement your Listener.
4.1.2 Attention (Attend)
The key feature of this homework (and the LAS model in general) is the attention mechanism.
Figure 3: Architecture: Here the Listener(Encoder) is the module that creates the input embeddings. The Attender propagates these embeddings applying attention. Finally, the Speller(Decoder) spells the outputs sequence.
Conceptually, attention works as follows. Each of the encoder-derived embeddings E0,··· ,EM−1 effectively represents a section of the input. As mentioned in Section 1, each symbol output by the decoder refers to a different portion of the input. Information on which portion of the input to refer (or pay attention) to, and what features to derive from it must both be decided from the encoder-derived embeddings. Both of these are typically derived through linear functions of the embeddings.
浙大学霸代写加微信 cstutorcs
In a typical implementation of attention, from the embeddings Ei the attention module computes two terms, a key Ki and a value Vi , through linear transforms: Ki = Ei WK , and Vi = Ei WV (assuming the Ei s to be row vectors), where WK and WV are learnable weights corresponding to the keys and values. Key Ki is used to compute the relevance of Ei to any output, and value Vi is the actual feature Ei contributes to it.
For each character embedding Dk, the attention module computes a query qk. The module uses the query qk and the keys Kj to compute a set of attention weights wkj. Attention weights are strictly non-negative and must sum to 1.0. They are typically computed by a softmax applied to a normalized innerproduct of the queries and the keys. Ideally the wkj values for embeddings Ej corresponding to inputs that are most relevant to Dk must be high, while others are low. This is expected