Data and Tasks jar for Sequence Classification — Recurrent Neural Networks(RNNs)
--
In the last article, we touched upon the base of RNNs and discussed how RNNs inherently covers all the desired properties of an ideal network to address sequence-based problems
In this article, Data and Tasks jar (6 jars of Machine Learning) specific to Recurrent neural networks are discussed
Data and Tasks
RNNs are typically used for 3 types of tasks:
Sequence Classification:
- Here the complete sequence is ingested as the input
- And the “model produces one output at the end” for example say if the sequence conveys positive/negative sentiment or the video-based sequence represents a specific class (example: “Surya namaskar” pose)
- Here the “input sequence” might have “n tokens/words/video frames” but the “model produces one output”
Sequence Labeling:
- For “every word” in the sequence, the “idea is to attach a label” to that — say part of speech tag (if the word is a noun, verb, adjective,.. and so on) or the named entity recognition
- Here the “output is produced for each word in the input” sequence basically for “n” words in the input sequence the model will produce “n” outputs)
Named entity recognition:
This basically refers to “assigning a label (NE or not) for each word in the sequence” if it contains “people names, location names, organization names”. Sometimes dates, numbers, and so on are also considered under Named entities depending on the use of the application.
So, for each word the model would provide a label whether it is NE(named entity) or not for example: let’s say the “input” is: “Ram went to Delhi yesterday”, the named entities, in this case, are listed with the “NE” label in the snipped below
Sequence Generation:
Machine Translation:
- Given an input in one language, the idea is to translate it to another language
- Say the “input sequence could be of length n”, the “output in itself would be a sequence of m words”
- Here, each word in the input could lead to more than 1 output; and similarly, more than one input might produce a single output
Say the input (4 words) is:
Its corresponding output (translated to the Hindi language would be):
The input is a sequence and the output is also a sequence and both may be of different length(s)
These are the typical tasks tackled using a Recurrent Neural Network. Let’s move on to the Data jar.
Data in case of sequence classification
Say the input is a bunch of sentences for the task of sentiment classification (there could be any other task of similar nature), this is what the data might look like in the tabular format
Here “x1, x2, x3” represents the first word, second word, third word, and so on till the maximum number of words (considering all sentences that form the input), and the “correct label” is also available as ‘y’
The snippet below clearly illustrates that the “input would be of varying length”, for example, there is no 8th word for the first sentence, only 6 words in the third sentence/input, and so on
And the second thing is that the “words need to be converted to numbers” as the neural network would take numeric input
So, it forms the supervised learning with a dataset consisting of “x, y”, where “x” is a sequence of words “{x1, x2, x3, ……., xn}” in the sentence and “y”, is the correct label
The “first step” is to represent the data in the matrix form, where each word in each row is represented as a one-hot encoded vector but the number of elements in every row of the matrix (that will correspond to each sentence) is actually different for example:
- the first sentence in this case has a total of 7 words — each of these words could be represented as a one-hot encoded vectors-that form the first row of the data matrix
- the second sentence seems to have a total of 8 words, so essentially it will have a total of 8 one-hot encoded vectors
- And similarly, each word in each row/input is converted to one-hot encoded vector form — and since the number of words would vary across input sentences, that means the dimension of each row is going to be different
The idea is to “create a data matrix X” of dimension say “m x n” so that each of the “m” rows has the same number of features “n” instead of a different dimension and this is something that could be taken care of in the data processing module
And the output is still okay in this case, there are “m output labels” (they are 0, 1 in this case but it could have been a multi-class problem as well — say sentence is “good”, “very good”, “bad”, “very bad”)
Data Pre-processing
A bunch of things is to be covered as part of data preprocessing before feeding it to the model and these are the standard for all Natural Language Processing(NLP) problems and all sequence problems:
Special symbols are defined:
- “<sos>” is for the “start of the sequence” — It is the artificial word added inserted at the beginning of a sentence, and it conveys the start of the sequence
- Similarly, the “<eos>” represents the “end of the sequence”
- Having the explicit special characters for “start of the sequence, end of the sequence” helps in better learning of the machine learning model
- Typically, the “special character” for “end of a sequence” is “inserted post the actual last character” (even if the last character is a “.” — and the reason is some sentences might end with an exclamation mark or a question mark, or if the data is noisy, the ending character might be missing as well)
Padding:
- The “maximum input length across all sequences (input sentences) is computed” and a special word/character (that just represents “padding”) is inserted at the end of all the shorter sequences so that all the sentences/sequences are of the same length
- Say the “max. input length is 10”, now there would be sentences with a length less than 10, for such cases, the “padding specific character” is added so that all sentences becomes a sequence of length 10 — (essentially this operation ensures all the sentences (in the data matrix) are of the same length
One hot encoding:
- Post incorporating the special characters (“<sos>, <eos>, <padding>”), one-hot encoded vectors for all the words are created
- A unique index is defined for all the unique words that appear in the training data (like a 1:1 mapping between index and word), say the first three indexes are reserved for the special words, and then for every new word that is encountered, its inserted in the running list of words/table (if the word is repeated across sentences, it’s inserted only once in the table)
The complete list of operations is listed below:
These operations help with two things at large:
- Ensures all the rows in the training data has the same number of features, words via adding special characters (<sos>, <eos>, <pad>)
- And there is a way to represent these words in the numeric form (using one-hot encoding)
Let’s take one example to cover these operations in detail:
Say the data consists of two rows, and all data-processing (adding special characters) is taken care of:
A dictionary with all the unique words (in training data) and special symbols is defined, and based on that one-hot encoded vectors are generated for all the words in all rows:
This is what a one-hot encoded vectors matrix for the input data would look like:
In the first row, the first word would actually be the “start of sequence” character, so the very first index has been assigned a value 1, and the value at all other indexes is 0
Similarly, one-hot encoded vectors are defined for all other words in the input sentence
Say there are a total of 24 unique words(including special characters) in this example, therefore each of the words would be represented using a 24-dimensional vector and since there are a total of 10 words (in each sentence after the pre-processing) so the dimension of each row in the input data would be 24*10 = 240
So, if there are ‘m’ input rows, the input matrix would be of size ‘mX240’
Tensorflow, PyTorch might store this information in a more compact manner for example for each word it might store just the indices where the value is 1 in the one-hot encoded vector (and this would point to the actual one-hot encoded vector under the hood) instead of the complete one-hot encoded vector for each word:
- For example, let’ say there are just two words each with a dimension of 5 and represented as [00100] [01000] respectively;
- Now PyTorch, Tensorflow might store this as [2 1] — here 2 corresponds to the index where the value is 1 (in one-hot encoded vector), in the first word and then 1 corresponds to the index where the value is 1 (in the one-hot encoded vector) for the second word
So, in this case, the matrix dimension would be “m X L” where “L” is the max length (taken as 5 in the example mentioned above)
Let’s see how the computation happens, say “s0” is all zeroes
Now the first input(“x1”) would be passed to the model, the first element from this input row would be taken, and would be multiplied with “U” and the bias term ‘b’ would be added to it and on top of this, non-linearity would be applied
After this, the second input from the same row (first row) comes into play
And this would be computed all the way up to “s10” and from “s10” the final output “y_hat” could be computed as
The entire picture looks like this:
Padding
There might be a question that the padding might corrupt the input as an artificial word is being added at the end of each input for example:
Here the original sentence (of length 2) was “Great movie”, where the data processing would add “6 pad,1 sos, and 1 eos” to this to make it of length 10
The computations for this would look like:
Ideally, the computation would have ended after computing “s4” because that’s all the original sentence contained but here it’s computing s5, s6, and so on in which case the “special character for pad” would be the input at every time step (after the 4th time step) and then computing the final output at the end.
This looks like it might corrupt the input
In practice, the idea would be to take the output after the “eos” itself, and from there it is evaluated whether the sentence conveys positive sentiment or the negative sentiment
Padding is incorporated to ensure the data matrix is consistent in terms of the number of features across all inputs
Another vector is passed as input to PyTorch, Tensorflow and this vector contains the true length of all the sentences, and is leveraged to do the computations till that point for example:
- if the input length corresponding to input is say 4, then the computations would be done till that point, and softmax would be applied on top of the output of that respective time step
References: PadhAI