Attention Is All You Need 经典范文重磅深度解析

引言

介绍

序列模型

transformer 产生的原因

RNN的缺点
CNN是否可以代替RNN
Transformer的优点

Simple RNN

transformer 1

transformer 2

模型架构

Overall Architecture
Encoder and Decoder Stacks
Attention
    Scaled Dot-Product Attention
    Multi-Head Attention
    Application of Attention in our Model
Position-wise Feed-Forward Networks
Embeddings and Softmax
Positional Encoding

模型架构1

Encoder and Decoder Stacks

Attention

Scaled Dot-Product Attention

Dot-Product Attention 1

Dot-Product Attention 2

Multi-Head Attention

Multi-Head Attention

Multi-Head Attention 1

Application of Attention in our Model
在encoder-decoder attention层中，queries来源于上一个decoder层，而keys和values来源于encoder的输出，以允许每个decoder位置捕获输入序列各个位置的信息，类似sequence-to-sequence模型的encoder-decoder attention机制。
在encoder的self-attention层中，每个位置可以捕获上一层所有位置的信息，因为所有keys、values和queries都来自同一处，即上一层encoder的输出。
在decoder的self-attention层中，每个位置可以捕获当前位置以及之前所有位置的信息，并通过添加mask来阻止信息向左流动，以保持自回归特性，确保decoder只依赖当前时刻之前的信息进行预测。
In the encoder-decoder attention layer, queries come from the previous decoder layer, while keys and values come from the output of the encoder, allowing each decoder position to capture information from various positions in the input sequence, similar to the encoder-decoder attention mechanism in sequence-to-sequence models.
In the self-attention layer of the encoder, each position can capture information from all positions above it, because all keys, values, and queries come from the same place, i.e., the output of the previous layer of the encoder.
In the self-attention layer of the decoder, each position can capture information from both current and previous positions. By adding masks, we prevent information from flowing leftward, maintaining the autoregressive property and ensuring that the decoder only relies on information available up to the current time step for prediction.

Position-wise Feed-Forward Networks

mlp

mlp 1

mlp 2

Embeddings and Softmax

Positional Encoding

实验

实验1

实验2

总结

总结1

参考资料

我的点评

绝对的经典只做，因为自从这个发表了之后，先是在各种推荐，广告排序模型场景中见到，然后是大模型开始广泛使用。奠基之作。

It sounds like you’re referring to the classic paper that laid the foundation for many applications of attention mechanisms, particularly in recommendation systems and large-scale models. One of the most influential works in this regard is the paper titled “Attention Is All You Need” by Vaswani et al., published in 2017. This paper introduced the Transformer architecture, which marked a significant milestone in the field of natural language processing and machine learning.

The Transformer architecture, as presented in this paper, is indeed a groundbreaking work. It replaced traditional recurrent neural networks (RNNs) in many NLP applications and introduced self-attention mechanisms as a core component. The introduction of the Transformer led to remarkable advancements in various areas, including:

Machine Translation: Transformers greatly improved the quality of machine translation, and models like “Google’s Transformer” (later known as the Transformer model) achieved state-of-the-art performance.
Large Pre-trained Models: The Transformer architecture paved the way for large pre-trained models like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and their variants, which have become the basis for numerous NLP applications.
Recommendation Systems: The attention mechanism introduced in Transformers found applications in recommendation systems, enabling personalized and efficient content recommendation.
Image and Video Processing: Transformers have been adapted for computer vision tasks, such as image classification and object detection, leading to models like Vision Transformers (ViTs).
Interpretable Attention: Researchers have explored ways to make attention mechanisms more interpretable and transparent, addressing the need for model explainability.

The paper “Attention Is All You Need” is indeed a classic and pivotal work that triggered a revolution in deep learning and set the stage for many subsequent developments in artificial intelligence.