引言
介绍
序列模型
transformer 产生的原因
RNN的缺点
CNN是否可以代替RNN
Transformer的优点
模型架构
Overall Architecture
Encoder and Decoder Stacks
Attention
Scaled Dot-Product Attention
Multi-Head Attention
Application of Attention in our Model
Position-wise Feed-Forward Networks
Embeddings and Softmax
Positional Encoding
Encoder and Decoder Stacks
Attention
- Scaled Dot-Product Attention
- Multi-Head Attention
Application of Attention in our Model
在encoder-decoder attention层中,queries来源于上一个decoder层,而keys和values来源于encoder的输出,以允许每个decoder位置捕获输入序列各个位置的信息,类似sequence-to-sequence模型的encoder-decoder attention机制。
在encoder的self-attention层中,每个位置可以捕获上一层所有位置的信息,因为所有keys、values和queries都来自同一处,即上一层encoder的输出。
在decoder的self-attention层中,每个位置可以捕获当前位置以及之前所有位置的信息,并通过添加mask来阻止信息向左流动,以保持自回归特性,确保decoder只依赖当前时刻之前的信息进行预测。
In the encoder-decoder attention layer, queries come from the previous decoder layer, while keys and values come from the output of the encoder, allowing each decoder position to capture information from various positions in the input sequence, similar to the encoder-decoder attention mechanism in sequence-to-sequence models.
In the self-attention layer of the encoder, each position can capture information from all positions above it, because all keys, values, and queries come from the same place, i.e., the output of the previous layer of the encoder.
In the self-attention layer of the decoder, each position can capture information from both current and previous positions. By adding masks, we prevent information from flowing leftward, maintaining the autoregressive property and ensuring that the decoder only relies on information available up to the current time step for prediction.
Position-wise Feed-Forward Networks
Embeddings and Softmax
Positional Encoding
相关工作
- Why Self-Attention
- n:序列长度
- d:向量长度
- k:卷积核大小
- r:邻居个数
- Why is self-attention important?
Self-attention is a method in neural networks for learning complex dependencies between input elements. It allows models to focus not only on local information within a sequence but also capture long-range dependencies. This ability is crucial for tasks like natural language processing, speech recognition, and machine translation.
Here are some advantages of self-attention:
Parallel Computation: Compared to recurrent neural networks (RNNs), self-attention can process all elements in a sequence in parallel, greatly improving computational efficiency.
Long-Range Dependencies: Self-attention can effectively capture long-range dependencies in a sequence, which is essential for understanding the grammatical and semantic structures in natural language.
Interpretability: Self-attention mechanisms provide a degree of interpretability, allowing us to examine the parts of the input that the model focuses on when making predictions.
Sparse Interaction: Self-attention often exhibits sparsity, meaning the model only attends to a small subset of elements in the input sequence when making predictions. This helps reduce computational complexity and the number of parameters, improving model efficiency and generalization.
Adaptation to Different Scales: Self-attention can adapt to different input scales, such as sentences, paragraphs, or documents, without changing the model’s structure.
In summary, self-attention is a powerful tool that aids models in better understanding and processing sequential data.
Training
Training Data and Batching
WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Has 37000 tokens
larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38].Hardware and Schedule
8 NVIDIA P100 GPUs.
Base models: each training step took about 0.4 seconds, 10w steps, 12 hours
Big models: each training step took about 1 seconds, 30w steps, 3.5 days- Optimizer
- Regularization
实验
总结
参考资料
- https://www.bilibili.com/video/BV1pu411o7BE/?spm_id_from=333.788&vd_source=5dc536abda18831d58d1dd35d61eee92
- https://zhuanlan.zhihu.com/p/500569055
- https://zhuanlan.zhihu.com/p/61494510
- https://blog.csdn.net/LiRongLu_/article/details/126384067
- https://blog.csdn.net/weixin_40607428/article/details/105407537
- https://zhuanlan.zhihu.com/p/497382888
- https://blog.csdn.net/weixin_60737527/article/details/127141542
我的点评
绝对的经典只做,因为自从这个发表了之后,先是在各种推荐,广告排序模型场景中见到,然后是大模型开始广泛使用。奠基之作。
It sounds like you’re referring to the classic paper that laid the foundation for many applications of attention mechanisms, particularly in recommendation systems and large-scale models. One of the most influential works in this regard is the paper titled “Attention Is All You Need” by Vaswani et al., published in 2017. This paper introduced the Transformer architecture, which marked a significant milestone in the field of natural language processing and machine learning.
The Transformer architecture, as presented in this paper, is indeed a groundbreaking work. It replaced traditional recurrent neural networks (RNNs) in many NLP applications and introduced self-attention mechanisms as a core component. The introduction of the Transformer led to remarkable advancements in various areas, including:
Machine Translation: Transformers greatly improved the quality of machine translation, and models like “Google’s Transformer” (later known as the Transformer model) achieved state-of-the-art performance.
Large Pre-trained Models: The Transformer architecture paved the way for large pre-trained models like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and their variants, which have become the basis for numerous NLP applications.
Recommendation Systems: The attention mechanism introduced in Transformers found applications in recommendation systems, enabling personalized and efficient content recommendation.
Image and Video Processing: Transformers have been adapted for computer vision tasks, such as image classification and object detection, leading to models like Vision Transformers (ViTs).
Interpretable Attention: Researchers have explored ways to make attention mechanisms more interpretable and transparent, addressing the need for model explainability.
The paper “Attention Is All You Need” is indeed a classic and pivotal work that triggered a revolution in deep learning and set the stage for many subsequent developments in artificial intelligence.