Sequence Learning Problems

Parveen Khurana
10 min readJul 6, 2019

In all of the networks that we have covered so far(Fully Connected Neural Network(FCNN), Convolutional Neural Network(CNN)):

  • the output at any time step is independent of the previous layer input/output
  • the input was always of the fixed-length/size

Say the below neural network is used to link patient biologic parameters to health risk (more like a classifier saying if the patient has health risk or not), then the model’s output for patient 2 would not be in any way linked to the model’s output for patient 1.

And all the patients/cases as input to this model would have the same number of parameters (height, weight, …., sugar, …etc.)

Fully Connected Neural Network (say all inputs are numeric and are standardized before passing to the model)

Similarly, for a CNN (say image classification), no matter if the output for input 1 was apple/bus/car/<any class>, it would not have any impact on output for input 2.

Let’s say all the input images are of size ’30 X 30’ (or if of different size, then we can rescale the input image to the required/appropriate dimension) so that all inputs have the same size

Convolutional Neural Network

In general, we can say that for fully connected neural networks, convolutional neural networks, the output at “t” time step is independent of any of the previous input/output(s)

And another property of these architectures/networks is that all the neurons in any of the layers are connected to all the neurons in the previous layer (for CNNs, we can think of it as the weights associated with most of the neurons from the previous layers is 0 (sparse connectivity) and we consider only a few neurons)

Two properties of FCNN and CNN:

  • Output at any time step is independent of previous inputs
  • Input is of a fixed length

Sequence Learning Problems

In “Sequence Learning Problems”, the “two properties of FCNN and CNNs do not hold” and the output at any timestep depends on previous input/output and the length of the input is not fixed.

Let’s consider the case of Auto-completion. Say the user types in the alphabet ‘d’, and the model tries to predict the next character

Consider this as the classification problem, the job is to identify the next character/alphabet of the 26 alphabets given the input is alphabet ‘d’. This output can be considered a distribution over those 26 characters (all alphabets)

There would be some true distribution and in this case, all the probability mass is on ‘e’ (assuming it is the true character that should come after ‘d’) and the probability mass would be 0 for all other characters in the true distribution i.e

y = [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

The predicted output y_hat is also going to be a distribution over these 26 characters. The output layer here as well would be a softmax layer

The problem is that the input is ‘d’ (string character) but the neural networks take the only number as the input

Let’s say one-hot encoding is used for all the characters, then

a’ would be [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

b’ would be [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

i.e each character/alphabet is represented by a 26-dimensional vector (where the vector/array length would be the total number of possible values that the variable can take) and the value in this vector/array at the index (corresponding to the index of the alphabet(index starts from 0)) would be set to 1 and everything else would be 0

So, the input ‘d’ would be represented as [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

In this case, input and output have the same dimension but that may not be the case always

Say there is one hidden layer and then the final output layer

And take the dimension of the hidden layer to be ‘20 X 1’ and the input, output dimension is ‘26 X 1’

The hidden layer would be computed as:

And similarly, bias(b) in the above equation is going to be a 20-dimensional vector, so (Wx + b) would give us a 20-dimensional vector, and a non-linear function could be applied on top of this to get the output at this intermediate layer as ‘h’(this represents the value at the hidden layer, not to be this confused with alphabet ‘h’) which is going to be a 20-dimensional vector

As ‘h’ is 20 dimensional and the output is supposed to be 26 dimensional(over the 26 alphabets), so ‘V’ would be of size (26 X 20) and ‘c’ would be a 26-dimensional vector, so (Vh + c) is computed first and then the softmax is applied on top of the output and that would give the value of ‘y_hat’ as the probability distribution

The way Auto-completion works is that it takes the top 3–4 entries from this ‘y_hat’ distribution and it suggests those characters

Let’s say out of these top 3, ‘e’ was there and user-selected ‘e’ as the second character(after ‘d’)

Now this ‘e’ acts as the input and the model again try to predict an output. The interesting part is that the output in this case now not only depends on the input(‘e’), it also depends on the input in the previous layer:

For example, if the previous input was ‘z’ instead of ‘d’ in the first layer(and ‘e’ is the output of the first layer), then there is a very high probability that the output of the second layer is going to be ‘b’ to complete word ‘zebra’.

Instead of ‘z’ whereas if the input in the first layer is “d” followed by “e” (as the output of the first layer), then it is very likely the second layer output would be “c” (deceive) or “e” (deep/deer) and not “b”

We have a sequence of inputs and the output depends on all previous sequences of inputs or at least some sequence of inputs for example in sentence completion, the 100th word might not depend on all the previous 99 words but for sure it would depend on the last 3–4 words.

The length of the input is also not fixed, for example for the above example there are 4 characters as the input but the number of characters would change (basis if the input is a long or a short word) and that is what is meant by the length of the input.

And instead of having the distribution over 26 characters, we could have it over say (26 + 1) characters where this additional character tells us that the sequence has ended (<eos> in the snippet above)

So, the output in Sequence Learning Problems depends on previous input also, and certainly, the length of the input is not fixed (and in a sense, it gives a sequence of outputs, one output at layer 1, another one at layer 2, … and so on)

Some more examples of Sequence Learning Problems

Part of speech tagging

Given a “sequence of words”, the idea is to “predict part of the speech tag for each word” (whether that word is a pronoun, noun, verb, article, adjective, and so on)

Here also the “output depends” not only on the “current input” but “also on the previous input(s)” for example:

Say the current input as ‘movie’ (last layer in the snippet above) and the previous input was ‘awesome’ which is an adjective; “the moment model sees an adjective” it would be more or less confident that the “next word is actually going to be a noun”

So, the confidence in predicting that the “movie (word) is a noun” would be higher “if the previous input was an adjective”.

There is this dependency where the current output depends not only on the current input but on the previous input as well

Let’s say the input is the word ‘bank’ in a sentence - this “could be a Verb” (I can bank on him) “or a Noun” (I had gone to a bank)

In the “Noun case”, the “previous word is an article”, from “which it is very unlikely” that the “following word would be a verb”, and “it is going to be a Noun”.

So, even in this type of ambiguous case, the previous sequence of words(the context) helps us to predict the output/make this decision

And here as well, “one-hot encoding” could be used “to represent words in form of numbers” - all the words could be enlisted (say there are 10000 words), with an index assigned to each word (that remains unchanged in subsequent iterations) and for the input(word), its index could be looked up, let’s say the index is 5 (assuming that index starts at 1), then its corresponding one-hot vector could be defined as given below (set its fifth entry as 1 and everything else as 0)

Again, from this one-hot encoded input, the pre-activation value (‘wx + b’) could be computed and non-linearity can be applied on top of it to pass on to the subsequent layers and finally through the softmax layer which gives the probability distribution

Here also it’s clear that the “output depends” on “previous inputs” and the “input could be of arbitrary length” (sentence could be of any number of words)

Sentimental Analysis:

It’s not mandatory to produce an output at each step(time step/layer) for example for sentiment analysis, the model looks at all the words in a sentence and gives/predicts a final output as positive/negative sentiment conveyed by the sentence.

This could be considered as output is produced at every time step but the model ignores those output and reports only the final output (and somehow this final output is dependent on all the previous inputs as well). This is also termed a Sequence Classification Problem.

Sequence Learning problems using video and speech data:

Speech Recognition —Think of speech as a sequence of phonemes and give the speech signal as the input, the idea would be to map each signal to its respective phoneme in the language

Another task could be to look at the entire sequence of speech and predict if tone/emotion of the person (say whether the person is speaking angrily, is happy/sad and etc.)

Video Labeling — A video is a sequence of frames (there might be some processing on these frames), one task could be to label every frame in the video (say which of 12 steps of Surya namaskar a frame corresponds to):

The other task could be to look at all the frames and give the label for the entire video sequence:

Here again, the input is not fixed in length (as the no. of frames would be different for different videos). And the output at the final time step would be dependent on all previous inputs (for example: just by the final frame, the model might label a video sequence as “Surya namaskar”, and this would inherently be dependent on all previous inputs)

In this article, we briefly touched upon the properties of sequence-based tasks, in the next article, let’s discuss how to model sequence-based tasks.

References: PadhAI