A self-attention layer is a component of transformer neural networks that allows the model to selectively focus on different parts of the input sequence when making predictions.
Advantages
- Better understand the relationships between different elements of the input sequence
How to calculate
- The self-attention mechanism works by computing attention scores between each pair of input elements,
- and then taking a weighted sum of the values of all input elements, where the weights are given by the attention scores.
- The attention scores are calculated based on the similarity between the embeddings of the input elements.
The self-attention layer can be mathematically represented as follows:
$$ \text{SelfAttention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$
- $Q$, $K$, and $V$ are the query, key, and value matrices, respectively, which are derived from the input sequence.
- $d_k$ is the dimension of the key vectors, and $softmax$ is the softmax function, which computes the attention scores.
The self-attention layer takes as input the query, key, and value matrices, and outputs a weighted sum of the value matrix, where the weights are given by the attention scores.
Examples
The self-attention layer is a key component of transformer neural networks, which have achieved state-of-the-art results in a variety of natural language processing tasks, such as machine translation and language understanding.
Suppose we want to build a machine learning model that can classify movie reviews as positive or negative.
We can represent each review as a sequence of word embeddings, where each word is embedded in a high-dimensional space.
The self-attention layer can then be used to selectively focus on different parts of the review when making predictions.
- First, we need to derive the query, key, and value matrices from the input sequence.
- We can do this by applying linear transformations to the input embeddings:
$$ Q = XW_Q, \quad K = XW_K, \quad V = XW_V $$
where $X$ is the input sequence of embeddings, and $W_Q$, $W_K$, and $W_V$ are learnable weight matrices.
- Next, we can compute the attention scores between each pair of input elements, using the dot product between the query and key matrices:
$$ A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) $$
where $d_k$ is the dimension of the key vectors.
- Finally, we can compute the weighted sum of the value matrix, using the attention scores as weights:
$$ \text{SelfAttention}(Q,K,V) = AV $$
The output of the self-attention layer is a weighted sum of the input embeddings, where the weights are determined by the attention scores.
This weighted sum can be fed into a feedforward neural network for further processing and classification.
The self-attention layer allows the model to selectively focus on different parts of the review, depending on the context and the importance of each word in the overall sentiment of the review.
This can improve the accuracy of the model compared to traditional neural network architectures that treat each input element equally.