Attention:

Sequence to Sequence Model:

Input sequence is provided and output sequence is derived from that input.

Encoder and Decoder:

The model encodes a particular input provided by us into something that we call as context vector that is passed to the decoder after the encoding which is then decoded by the help of the decoder.

Now we can always use a big decoder i.e., the output from all the hidden states but then we have performance issues and the chances of overfitting. So, we usually end up using the last hidden state output of the encoder LSTM cell as context vector. This problem is solved by the help of Attention.

Before we start feeding our words to encoder they are embedded.

Encoder: An RNN.

  • Takes an embedded input.
  • After the encoding, the output of each hidden state is scored with a scoring function.
  • These scores are then fed to a SoftMax function to get the positive output lying between 0 and 1 and also, they all sum up to 1.
  • Each hidden state vectors are now multiplied with their SoftMax scores. This becomes the context vector for the decoder layer.

Decoder: An RNN.

Attention Decoder: The attention decoder initially takes two inputs. One the END embedded token and the other is the first context vector provided by the encoder to the decoder. Now these inputs are passed through the decoder RNN and then the RNN produces a hidden state and an output. Then the next stacked RNN is fed with three things. The new hidden state provided from the previous RNN, the output provided by the previous RNN and the next Context vector of the next input. This RNN here after sends the same output as the first one and similarly the computation continues from the next RNN onward if there is any.

Well now let me elaborate on the scoring of the Encoded Hidden States in the Attention decoder. These encoded hidden states when passed to the decoder are first initially scored and then are multiplied with them. This scoring can be done in two ways, which are basically called Multiplicative Attention and Additive Attention.

Multiplicative Attention:

The decoder after receiving the encoded hidden states from the encoder scores them. The scores are calculated by a few formulas:

  • The first step is to use the END encoded input from the encoder as input for the decoder along with a dummy hidden state. Only the hidden state is required for further calculation and rest of the output is neglected.
  • Initially the score is calculated by doing dot product of the hidden state of the previous RNN node in the decoder itself and the hidden states of the encoder. As this is attention decoder all the encoder hidden states are considered here. So, it is a sum of the product of each encoder hidden state and the decoder hidden state, which is also known as the score. Sometimes a weight matrix is introduced (second scoring function shown below) as there might be some machine translation to be done, where there might be different embedding space with different shape.
  • Then, SoftMax of the result is calculated and this SoftMax values are multiplied with the encoded hidden states to get the attention context vector.
  • The Attention context vector is concatenated with the new hidden state generated by the RNN, which is then passed through a fully connected layer where the tanh is calculated to get the final output. This output becomes the input for the next RNN. And then the next hidden state is use for the Attention context vector calculation and the whole process explained above is repeated.

Additive Attention:

It doesn’t really have anything significant to explain. It does exactly as the formula stands for.

The process is further optimized by the help of transformers. I would explain more about it in the next upcoming post.