Parveen Khurana
10 min readJan 31, 2022

How to model sequence learning problems?

In the previous article, we looked at the base and properties of sequence learning problems, and some examples of sequence learning problems in NLP, video, and speech data. In this article, let’s dive into the details of the modeling aspect of sequence-based problems

Let’s consider the task of character prediction (in which output is predicted at every time step)

The snippet above describes a basic model that takes in a number of inputs and predicts output for each time step in such a manner that output at any specific timestamp is dependent on current input as well as all previous inputs

Let’s call the “output at the time step 1” as “y₁”, “output at the time step 2” as “y₂”, and so on. In general, the “output at the t’th time step” is represented as “yₜ

The true relationship is of such form that “yₜ depends on all the previous inputs

True relation

It may not depend on all the previous inputs (or very much impacted by all previous inputs) but at least it depends on some of the previous inputs in addition to the current time step input

The function/model needs to approximate the relation between the input and the output (y_hat) and ensure that the function somehow depends on the previous input as well

Approximation of the true relationship

At each time step, the “function needs to do the same processing:

  • Take in the current input, and the history of the inputs, and then predicts the output
  • The output is also from the same set (from 26 alphabets in this case (or say one of the 10 speech tags for part of speech tag prediction or one of the yoga poses for the video sequence-based problem (say Surya namaskar))

So, the “task is not changing from one time step to the other time step (the same function is to be executed)”, the “only thing changing is the input to the function

The “function should be able to work with the variable number of inputs”: let's say for this word there are 4 characters, for some other word there could be 10 characters, so the same function should be able to take in a variable number of inputs

Overall, the need is for a “function that can act as a “good approximation of the relation between the input and the output” while serving these 3 things:

For the cases wherein the idea is to predict just one output(video label) given all the inputs, the same conditions should hold i.e

  • the function should be able to deal with a variable number of inputs
  • the output should depend on all the previous inputs

In this case, the other point(same function being executed at every time step does not hold) because we are not making a prediction at every time step

Let’s start with the property “the function being executed at each time step is the same” — the rationale is that at every time step, the function is doing the same job, it takes an input (and a history of inputs), and predict the output(say the part of the speech tag)

Let the input at the i’th time step is: “xᵢ

The “input is multiplied with the weight matrix”, and “the bias term is added to it”, and then “non-linearity is applied on top of the matrix post summation” which gives the hᵢ for the “i’th input

Then the “same set of arithmetic operations” are repeated i.e “hᵢ is multiplied with the weight matrix”, and “the bias term is added to it”, followed by a “non-linear transformation on top of the matrix post this summation” to get the output (dimensions are taken in such a way that the output has the probability distribution over all the possible classes as per problem statement)

The network parameters i.e W₁, W₂, b₁, b₂ remain the same(at every time step) and hence the function executed is also the same. This is achieved using the concept of “Parameter Sharing” (same parameters at all the time steps)

In Recurrent Neural Network (RNNs) the nomenclature is a little different compared to Fully Connected Neural Network (FCNN)

  • W₁, W₂ are represented by U, V
  • b₁, b₂ are represented by b, c
  • hᵢ is represented by sᵢ
  • s is termed as the state

Now, “this representation of the function/network inherently” takes care of another property as well i.e the above-defined “function can deal with a variable number of inputs

To compute say y₉₉, the equation would be:

The “defined function is such that it is irrespective of the time step”, even if there are a variable number of inputs, the idea would be to plug in the current time step’s input and get the output but the “downside is ”that the first most “property is not ensured”: “yₜ is dependent on previous inputs also”:

In the current representation, the output at any time step is dependent on that input for that time step in addition to the weights/parameters and the bias terms

To satisfy the first property, the “function should be something like”:

At the first time step, the output depends only on the first input:

Then at the second time step, the first two inputs could be concatenated which is passed through the network to compute the output

Here x₁, x₂ are concatenated to form “x which in turn is multiplied with the weights/parameter matrix U to computesᵢ”, which in turn is passed through a non-linear function that would yield the value ofyᵢ

Now the situation is that:

  • y₁ is a function of x₁
  • y₂ is a function of x₁ and x₂

For y₃, x₁, x₂, and x₃ are concatenated to form x, which in turn is passed through the network to compute sᵢ and using sᵢ, yᵢ value could be computed (by passing it through a non-linear function)

So here:

  • y₃ would be a function of x₁, x₂, x₃

If this approach is taken, then the function executed at every time step is not the same (dimension of weight matrix would keep on changing with a different set of concatenated inputs) and as a result of that, parameter sharing won’t be there:

At the first time step, the input is x₁, which would be a vector of dimension (nX1), the hidden representation(say ‘d’ dimensional) is to be computed using x₁, and the expectation would be the weight matrix “U” could convert the “n” dimensional input to “d” dimensional output and so the dimension of “U would be (d X n)

Now at the second time step, x₁ and x₂ are concatenated to form x, the dimension of x here would be (2n X 1) and the expected dimension of “U would be (d X 2n) to convert “2n” dimensional input to a “d” dimensional output

The weights matrix “U in the second time step” is going to be “different” than the “U used in the first time step” and that’s an “indication” that the “parameters are not being shared” which in turn “implies” that the “function being executed at every step is going to be different

If “inputs addition is carried out instead of concatenation”, in that case, the dimension would remain the same but there is a catch here that at every time step, “the semantics of the input is changing”:

At the first time step the input is just “x₁; at the second time step it is “x₁ + x₂ so the semantics is changing; then at the next time step, it’s “x₁ + x₂ + x₃ so again the meaning of the input actually changes, it’s not the same as just feeding 1 word, it’s more of this artificial addition which is not the right thing to do. If we write this as a function, it would still be a function of all the inputs, but this again would not serve the complete purpose:

  • At every time step, the number of parameters would increase
  • The function executed at every time step in itself is different
  • And this functional representation can not deal with arbitrary input size because let’s say at training time all our data was of length up to 20 only but at test time if the model gets a sentence of length 21, now it needs a new “U” parameter of dimension (“d X 21n”), and this parameter was not in the training, so this would then result in not very good results

At this point, two representations have been discussed, but none of them satisfies the 3 requirements of the model/network that are required to address a sequence-based problem, this sets the context for the Recurrent Neural Network that inherently satisfies all the required properties to tackle sequence-based problems

Recurrent Neural Network (RNNs)

In the functional representation(s) discussed above, each time step was more or less acting as an independent unit, and the same function was being executed at each time step, now to make sure that all ideal properties are satisfied, the model’s equation is re-defined as below:

There are recurrent connections between the intermediate output/hidden states(connected with parameter “W)

Now “s₂(intermediate output at time step 2), would depend on “x₂ and “s₁(intermediate output at time step 1), ands₁, in turn, depends on “x₁, so ultimately the output at the 2nd time step inherently depends upon the input for the first time step:

Let’s discuss the dimensions of different parameters used in this equation:

  • Input “xi” is taken to be an n-dimensional vector (“nX1”)
  • The weight matrix “U” would be of dimension “d X n” to convert “n” dimensional input to “d” dimensional output (intermediate output)
  • si”, “s(i-1)” would be a “d” dimensional vector as per the design of the problem (intermediate output is assumed to be a “d” dimensional vector) (“dX1”)
  • And since “s(i-1)” is a “d” dimensional vector, “W” needs to be a “d X d” matrix so that the results of this multiplication is again a “d” dimensional vector (“dX1”)
  • And the bias term “b” as well should be a “d” dimensional vector
  • And then the weight matrix at the second step that is “V” should be of “k X d” dimension (assuming there are “k” output classes) so that the multiplication of “V” and “si” gives a “k” dimensional vector

So, with this set of equation(s), it is clear that the same function is going to be executed at all time steps, all the parameters remain the same only the input changes

The second property i.e the current output must depend on the current input as well the previous inputs is also satisfied:

Output at time step 2 depends on the intermediate output at time step 2 which in turn depends on two inputs:

  • input at time step 2 and the intermediate output at step 1
  • Now the intermediate output at step 1 in turn depends on input at time step 1
  • So, inherently, the output at any time step would depend on all the previous inputs as depicted in the image below:

And this function can also deal with a variable number of inputs as well

So, this is what a Recurrent Network looks like, there are these recurrent connections between the hidden states that ensure all the ideal requirements are met to address sequence-based problems

References: PadhAI