Looking back to 2018, the development of NLP field in deep learning was mainly focused on word2vec and designing customized deep models for various tasks. Although pretrained models such as ELMo had begun to emerge, their influence was still limited. In this context, the first generation of GPT pretrained language models was released.
GPT的全名是“Improving Language Understanding by Generative Pre-Training”,意为通过生成式预训练来提升语言理解能力。从论文标题可以引出以下两个问题:
The full name of GPT is “Improving Language Understanding by Generative Pre-Training”, which means improving language understanding through generative pretraining. The title of the paper leads to the following two questions:
What is meant by generality? What kind of objective function is effective when learning general, transferable representations of text?
Once a general feature representation is obtained, how can it be applied to different downstream tasks?
GPT solved these two problems through pretraining and fine-tuning methods.
在多样化的无标注文本语料库上对语言模型进行「生成式预训练」(即 GPT),然后对每个特定任务进行「判别式微调」,可以在这些任务上实现大幅能力提升
Due to the limited labeled data, discriminative training models for natural language understanding tasks such as text comprehension, question answering, semantic similarity evaluation, and document classification have difficulty performing well.
By performing “generative pretraining” on a diverse unlabeled text corpus using language models (i.e., GPT), and then fine-tuning each specific task using a discriminative approach, significant improvements can be achieved on these tasks.
Direction: To effectively reduce the reliance on supervised learning in Natural Language Processing (NLP), and to achieve this goal through experimental demonstration.
Challenge: It is currently unclear which type of optimization objective is most effective when learning transferable text representations, and there is no clear consensus on how to effectively apply these learned representations to target tasks.
Solution: We propose a semi-supervised approach that combines unsupervised pre-training and supervised fine-tuning. We adopt the Transformer architecture, which provides more structured memory and can handle long-term dependencies in text,从而实现强大的迁移性能.
Pretrain a neural network (LSTM) with language modeling objectives, and then fine-tune it on a supervised target task
Auxiliary training objectives
Adding auxiliary unsupervised training objectives is another form of semi-supervised learning. Unsupervised pretraining has learned various language aspects related to the target task.
During the pretraining stage, the objective of the GPT model is to build a powerful language model using a large amount of input text data U = {U1, U2, …, Un} to maximize the likelihood of text. The main function of this language model is to predict the probability of the appearance of the i-th vocabulary given a series of previous vocabularies (from the i-kth to the i-1st vocabularies). Here, parameter k represents the size of a sliding window, which determines the context range that the model observes in the text. The value of k can vary according to the requirements of the task; if set to a larger value, the model will consider more contextual information, while a smaller k value will limit the context range of the model.
作者选择了Transformer的Decoder作为GPT模型的核心结构,用来处理输入文本U = (u−k, . . . , u−1)。首先,这些输入词汇会通过一个映射矩阵转化为词嵌入表示,然后加上位置嵌入,以便模型能够理解词汇在句子中的位置信息。接下来,这些词嵌入会被送入Transformer块进行更新,Transformer块负责对输入的词汇进行上下文感知的处理。最后,模型将经过Transformer块的输出传递到全连接层,以生成最终的预测结果。
The author chooses the Transformer’s Decoder as the core structure of the GPT model to process the input text U = (u-k, …, u-1). Firstly, these input vocabularies will be transformed into word embeddings through a mapping matrix, and then added with position embeddings to enable the model to understand the position information of the vocabularies in the sentence. Next, these word embeddings will be sent into Transformer blocks for updating. The Transformer block is responsible for processing the input vocabularies with context awareness. Finally, the model will pass the output of the Transformer block to a fully connected layer to generate the final prediction result.
In summary, this process constructs a powerful language model by transforming the input text into word embeddings, adding position information, using Transformer for context processing, and passing the output through a fully connected layer to generate predictions. This model can be applied to various natural language processing tasks, and can adapt to different contextual information based on the choice of k.
其中,We为词嵌入矩阵,Wp为位置嵌入矩阵,ℎl为第l层 transformer 的输出,ℎn为最后一层 transformer 的输出,n为模型层数。
Among them, We represents the word embedding matrix, Wp represents the position embedding matrix, h_l represents the output of the l-th transformer layer, h_n represents the output of the last transformer layer, and n represents the number of model layers.
Unlike the basic transformer that uses trigonometric functions for position embedding, this paper uses a learnable position matrix to represent position information. In practical applications, the two approaches seem to have similar effectiveness.
In the field of NLP, common downstream tasks include text classification, text inference, text similarity, and question answering tasks. For these tasks, the method of building input sequences typically follows a common pattern: special markers are added at the beginning and end of the sequence, and appropriate separators are used to separate the sequence. It should be emphasized that in practical applications, literal words such as “Start/Extract/Delim” are not used, but specific special symbols are used to represent these markers.
Regardless of the downstream task, the construction of input sequences follows similar principles. First, the beginning and end of the sequence will be added with a starting and end marker, respectively. Then, if there are multiple input sequences, they are usually separated by appropriate separators to ensure that the model can correctly understand the boundaries of each sequence.
Once the input sequence is built, these sequences will be feature encoded using a pre-trained GPT model. After encoding, typically the feature vector of the last marker in the sequence is used for downstream task prediction. It should be noted that different downstream tasks may have different prediction layer designs to accommodate specific task requirements.
In summary, although the input sequence and prediction layer design vary for different downstream tasks, the feature extraction module remains unchanged in this process, which enables the model to have good transferability and be applicable to various different NLP tasks.
1 | 对于分类任务,给定一段文本,目标是预测其标签。 |
在微调阶段,通过给定输入序列X1到Xm,我们的目标是预测Y的概率。这一过程首先涉及将序列输入到预先训练好的模型中,以获得最后一层 transformer 的最后一个 token Xm 的特征 Hlm。接下来,我们使用softmax函数对这些特征进行分类,从而得到最终的预测结果。
In the fine-tuning stage, given input sequence X1 to Xm, our goal is to predict the probability of Y. This process first involves inputting the sequence into a pre-trained model to obtain the features Hlm of the last token Xm in the last layer of the transformer. Next, we classify these features using a softmax function to obtain the final prediction result.
The author combines two loss functions in the objective function of the fine-tuning stage, one from pre-training and the other from the task with supervisory information. This combination of loss functions is found to achieve better results in the fine-tuning process, so the author adds the two loss functions together and uses it for model training. Such a training strategy helps to improve model performance.
在无监督训练中,我们使用了BooksCorpus数据集,该数据集包含7000多本未发表的书籍,覆盖了多种不同流派,如冒险、幻想、浪漫等。这个数据集的特点是它包含了连续的长文本段落,使我们的生成模型能够学习到长距离的上下文信息,这对于语言模型的性能至关重要。另一个数据集是1B Word Benchmark,与ELMo使用的数据集大小相当,但在预处理过程中被洗牌,导致了破坏了长距离的文本结构。我们的语言模型在前述语料库上表现出了出色的性能,具体而言,它的令牌级困惑度仅为18.4。
This is a passage about natural language processing (NLP) fine-tuning techniques, including the description of BooksCorpus dataset, the language model architecture and the preprocessing process. In the unsupervised training, we used BooksCorpus dataset, which contains over 7,000 unpublished books spanning multiple genres such as adventure, fantasy, and romance. This dataset characterized by long, contiguous text paragraphs allowed our generation model to learn long-distance contextual information, crucial for language model performance. Another dataset is the 1B Word Benchmark, which is on par with the size of the dataset used by ELMo but shuffled during preprocessing, disrupting long-distance text structure. Our language model demonstrates excellent performance on this corpus with a token-level perplexity of only 18.4.
In terms of model architecture, our model mainly resembles the original Transformer architecture. We trained a Transformer with only 12 decoder layers, each containing 768 dimensions and 12 self-attention heads. For the position-wise feedforward network, we employed 3072 dimensions. To optimize the model, we used the Adam optimization algorithm with a maximum learning rate of 2.5e-4. The learning rate increases linearly from zero for the first 2000 updates and follows a cosine annealing schedule thereafter. We employed small batch training with 64 randomly sampled examples per batch, each containing 512 consecutive tokens, and trained for 100 epochs. The model weights were initialized using a simple distribution N(0, 0.02). Our vocabulary relied on a merged set of 40,000 byte pair encoding (BPE) units. To prevent overfitting, we applied 0.1 regularization to both residuals and embeddings and attention weights were modified using L2 regularization with a weight of 0.01. The activation function was implemented using the Gaussian Error Linear Unit (GELU). Finally, we employed learned position embeddings instead of traditional sine position embeddings. In terms of data preprocessing, we used the ftfy library to clean up raw text from BooksCorpus, normalized punctuation and spaces, and applied spaCy tokenizer for sentence processing.
我们采用了一致的超参数设置,除非特别注明。在分类器中,我们引入了0.1的丢弃率(dropout rate)。对于大多数任务,我们选择了学习率为6.25e-5和批量大小为32。微调过程相对迅速,通常情况下,进行3次训练已经足够。我们采用线性学习率衰减策略,其中在训练的前0.2%进行预热。此外,我们将λ值设置为0.5。
We used consistent hyperparameter settings unless specified otherwise. In the classifier layer, we introduced a dropout rate of 0.1. For most tasks, we selected a learning rate of 6.25e-5 and a batch size of 32. The fine-tuning process was relatively rapid and typically only required training for 3 epochs. We applied linear learning rate decay with a warmup period accounting for 0.2% of training time. Additionally, we set λ to 0.5.
Experimental results have shown that increasing the number of decoder layers in Transformer models can significantly improve their performance and generalization ability. This improvement is not only excellent during pre-training stages but also verified when completing supervised tasks. Experimental results have shown that with every additional Transformer decoder layer, the model’s performance is further enhanced. In experiments on MultiNLI and RACE datasets, we observed that transferred embedding vectors can enhance performance, and each Transformer layer provides additional performance improvements. For example, on MultiNLI, the performance of a fully transferred model improved by 9%. This indicates that each pre-trained Transformer layer contains useful features for solving target tasks.
In addition, we also studied the zero-shot learning ability of pre-trained models, which refers to the models’ performance on various tasks without fine-tuning. Experimental results showed that Transformer models perform excellently in zero-shot learning. They possess stronger and stronger zero-shot abilities as pre-training proceeds, indicating that the underlying generative model improves language modeling by learning to perform various tasks. Models with more complex structures are able to benefit more from this learning process,从而实现zero-shot learning abilities. The attention memory mechanism of Transformer models is also very helpful for transfer learning, and they perform more superior than traditional models such as LSTM. Moreover, we observed that the performance of LSTM on zero-shot learning has higher variance, indicating that the architecture of Transformer models has lower inductive biases for transfer learning and is therefore more suitable to meet the challenges of different tasks.
We conducted three different ablation studies to investigate the performance of our method under different conditions. First, we examined the performance of our method without the auxiliary language model objective. We found that the auxiliary objective improves the performance on NLI and QQP tasks. The overall trend indicates that the auxiliary objective has a positive impact on large datasets but limited influence on small datasets. Second, we compared our method with a single-layer 2048-unit LSTM to investigate the effectiveness of the Transformer. The results show that using an LSTM instead of a Transformer leads to an average score decrease of 5.6 points, while the LSTM only slightly outperforms the Transformer on the MRPC dataset. Finally, we directly compared the Transformer architecture and our full model on the supervised target task without pre-training. We observed that the lack of pre-training significantly reduces the performance on all tasks, leading to a performance decrease of 14.8% compared to our full model.
1 | 提出了一种方法框架,它通过整合生成式预训练和判别式微调技术,使得独立任务的模型具备更强大的自然语言理解能力。这一方法依赖于无监督(预)训练,以提升判别任务的性能表现,同时也推动了新的无监督学习研究的发展。 |
