ARTIFICIAL INTELLIGENCE (36) – Natural Language Processing (14) Understanding Attention in Neural Machine Translation

Attention is one of the most important ideas in modern neural networks for language processing, especially in Neural Machine Translation (NMT) and Transformer models. Its main goal is to help the model decide where to focus when generating each word of a translation. Instead of treating all words in a sentence as equally important, attention allows the model to dynamically prioritize the most relevant ones.

At the core of the attention mechanism are three elements: queries, keys, and values. Queries and keys are vectors with a fixed dimensionality . When a model wants to measure how relevant a key is for a given query, it uses a mathematical operation called the dot product. The dot product takes two vectors of the same size, multiplies their components, and then sums the results. Although all dimensions are used in this calculation, they are reduced into a single scalar value. This is why the dimension does not appear in the final attention scores—it is consumed during the summation.

When we compute attention scores for multiple queries and keys, the result is an attention matrix. If we have queries and keys, the attention matrix has a size of . Each element of this matrix represents how strongly one query is related to one key. These scores indicate which source words are relevant for each step of the decoding process.

To make these scores usable, they are passed through a softmax function, producing what are called attention weights. These weights form a probability distribution and tell us how much importance the model assigns to each source word. In an NMT model, the values of attention do not represent target words or model predictions. Instead, they correspond to source sentence words, showing how much each source word contributes to generating the current target word.

At a single decoding time step, the model then computes a context vector. This context vector is a weighted sum of the value vectors, where the weights come from the attention distribution. To compute the final attention output, we multiply the attention weights by the values, not by the keys or the raw logits. This step aggregates information from the source sentence in a way that reflects the model’s current focus.

Another important concept in NMT attention is concatenation. Concatenation simply means joining vectors together end‑to‑end. In an attention‑based NMT decoder, the hidden state of the decoder at a given time step is concatenated with the context vector. This combined vector contains both the decoder’s internal state and the relevant information from the source sentence, making it more informative for predicting the next word.

In Transformer models, attention uses a scaled dot product. After computing the dot products between queries and keys, the result is divided by the square root of the dimension . The reason for this scaling is to reduce the sharpness of the attention distribution. Without scaling, large dot‑product values can cause the softmax output to become extremely peaked, leading to unstable training and vanishing gradients. Scaling keeps the values in a reasonable range and helps the model learn more effectively.

In conclusion, attention works by computing similarities between queries and keys, transforming them into weights, and using those weights to combine values. These operations allow neural networks to focus on the most relevant parts of the source sentence at each step, making attention a powerful and essential component of modern language models.

© Image.Saurabh Badole

«In recent years, the field of Natural Language Processing (NLP) has experienced significant advancements, and at the heart of this revolution is the Transformer model. Introduced in the groundbreaking paper «Attention is All You Need,» Transformers have redefined how machines understand and generate human language.» (quote: Saurabh Badole)

References:

Why Attention Is All You Need?

 

 

Licencia Creative Commons@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Deja un comentario