Data and Tasks jar for Sequence Labeling — Recurrent Neural Networks(RNNs)

Parveen Khurana
5 min readFeb 6, 2022


In the last article, we discussed the Data and Task jar for Sequence classification-specific problems. In this article, we touch upon the Data and Task jar for Sequence labeling problems

Data and Tasks for Sequence Labeling

Let’s first discuss the objective of Sequence Labeling — Here, for every word in the input sentence, the model predicts an output

Say the input consists of a number of sequences, the tabular representation of the same would be of the form (post tokenization)

And for each word of each sentence/row, there is the corresponding output

For example for the first sentence, the first word is “The” is a determiner, then the second word “first” is an adjective, the third word “half” is a “noun” and so on.

DT is for determiner, AJ for adjective, NN for noun, VB for verb, PC for punctuation, and so on.

For every word in the input, there would be its respective true output as well - essentially a “1:1 mapping in the sense that each input word would have some output”. And since the input sentence could have a variable number of words, and there is this 1:1 mapping between input and output, which implies “the output would also be of variable length

And the “input and output needs to be converted to numbers” as the model takes in numeric inputs

Let’s look at the operations to be covered under data pre-processing:

  • Special symbols are to be defined” to denote “start of sequence”, “end of sequence” and then for the “padding operation” — the details of these symbols and the rationale for the same is discussed in this article

Here is what the input, the output would look like post-incorporation of special symbols

One thing to notice is if the input word is a special character (say the “start of sequence” or “end of sequence” or “pad”), the respective output also reflects the same character

Say the max. sentence length is 10 across all input sentences, and there are “m” data rows in the input, then the “input data matrix size” would have dimension as “m x 10” and each of the 10 indices would refer to the index in the one-hot encoded vector where the value is 1

And since there is going to be a 1:1 mapping between the input and the output, the dimension of the “output matrix size” is also going to be “m x 10” (here 10 acts as indices of labels in the sentence) and here as well the framework like PyTorch, Tensorflow might just capture the index where the value is 1 index of storing the actual one-hot encoded vector

  • Data preparation for conversion to the one-hot encoded form

Post the incorporation of special characters in each sentence, the idea would be to “prepare a running list of all unique words in the training data” and “to assign it a unique index to each word ”— this table would then be leveraged to “prepare a one-hot encoded vector for each word

And since this is a multi-class classification in a sense, all possible output values are tabulated and are assigned an index that is used to represent the one-hot encoded vector

Computation process:

The first word of the first input (“x1”) is passed to the model, and would be used to compute “s1”, and from this “y1_hat” is computed using “the softmax function” and since it’s more of a supervised learning problem, the true distribution “y” is already available, and the true, predicted output could be compared to compute the loss value

Here as well, “s0” is taken as a vector with all entries as 0

In the next iteration, the second word of the same input sentence would be passed to the model, and all the computations would be done in a similar fashion and so on for the input for subsequent time steps

Here also the incorporation of padding would not corrupt the input, it is added to make sure the dimensions are consistent across all input sentences, and the actual computations would happen only up to the true length (again, the vector with the true length for all the inputs is passed as input to the model)

References: PadhAI