重点
回顾到2018年,当时NLP领域在深度学习方面的发展主要集中在word2vec和为各种任务设计定制深度模型上。尽管ELMo等预训练模型已经开始涌现,但它们的影响力仍然有限。在这种背景下,第一代GPT预训练语言模型问世了。
Looking back to 2018, the development of NLP field in deep learning was mainly focused on word2vec and designing customized deep models for various tasks. Although pretrained models such as ELMo had begun to emerge, their influence was still limited. In this context, the first generation of GPT pretrained language models was released.
GPT的全名是“Improving Language Understanding by Generative Pre-Training”,意为通过生成式预训练来提升语言理解能力。从论文标题可以引出以下两个问题:
The full name of GPT is “Improving Language Understanding by Generative Pre-Training”, which means improving language understanding through generative pretraining. The title of the paper leads to the following two questions:
通用性是指什么?在学习通用的、迁移性强的文本特征表达时,什么样的目标函数是有效的?
一旦获得了通用的特征表达方式,如何将它应用到不同的下游任务中?
GPT通过预训练和微调的方法解决了这两个问题。
What is meant by generality? What kind of objective function is effective when learning general, transferable representations of text?
Once a general feature representation is obtained, how can it be applied to different downstream tasks?
GPT solved these two problems through pretraining and fine-tuning methods.
GPT进化路线
摘要
因为标注数据少,文本蕴含、问答、语义相似度评估和文档分类等自然语言理解任务的辨别式训练模型难以表现良好
在多样化的无标注文本语料库上对语言模型进行「生成式预训练」(即 GPT),然后对每个特定任务进行「判别式微调」,可以在这些任务上实现大幅能力提升
Due to the limited labeled data, discriminative training models for natural language understanding tasks such as text comprehension, question answering, semantic similarity evaluation, and document classification have difficulty performing well.
By performing “generative pretraining” on a diverse unlabeled text corpus using language models (i.e., GPT), and then fine-tuning each specific task using a discriminative approach, significant improvements can be achieved on these tasks.
介绍
方向:有效减少自然语言处理(NLP)中对监督学习的依赖,通过实验证明的方式来达成这一目标。
Direction: To effectively reduce the reliance on supervised learning in Natural Language Processing (NLP), and to achieve this goal through experimental demonstration.
挑战:目前尚不清楚在学习有助于迁移的文本表示时,哪种类型的优化目标最有效;并且关于如何最有效地将这些学到的表示应用到目标任务中,尚无明确共识。
Challenge: It is currently unclear which type of optimization objective is most effective when learning transferable text representations, and there is no clear consensus on how to effectively apply these learned representations to target tasks.
方案:我们提出了一种半监督方法,将无监督预训练和监督微调相结合。我们采用Transformer架构,它提供了更结构化的记忆,可以处理文本中的长期依赖关系,从而实现强大的迁移性能。
Solution: We propose a semi-supervised approach that combines unsupervised pre-training and supervised fine-tuning. We adopt the Transformer architecture, which provides more structured memory and can handle long-term dependencies in text,从而实现强大的迁移性能.
相关工作
NLP 的半监督学习
1 |
|
无监督预训练
使用语言建模目标预训练神经网络(LSTM),然后在有监督的目标任务上对其进行微调
Pretrain a neural network (LSTM) with language modeling objectives, and then fine-tune it on a supervised target task
Auxiliary training objectives
添加辅助无监督训练目标是半监督学习的另一种形式
无监督预训练已经学会了与目标任务相关的多种语言方面
Adding auxiliary unsupervised training objectives is another form of semi-supervised learning. Unsupervised pretraining has learned various language aspects related to the target task.
框架
无监督预训练
在预训练阶段,GPT模型的目标是利用大量的输入文本数据U={U1,U2,…,Un}来构建一个强大的语言模型,其任务是极大化文本的似然概率。这个语言模型的主要功能是在给定一系列先前词汇(从第i-k到第i-1个词)的情况下,预测第i个词汇的出现概率。其中,参数k代表了一个滑动窗口的大小,它决定了模型在文本中观察的上下文范围。k的值可以根据任务的需求而变化,如果设置为较大的值,模型将考虑更多的上下文信息,而较小的k值则限制了模型的上下文范围。
During the pretraining stage, the objective of the GPT model is to build a powerful language model using a large amount of input text data U = {U1, U2, …, Un} to maximize the likelihood of text. The main function of this language model is to predict the probability of the appearance of the i-th vocabulary given a series of previous vocabularies (from the i-kth to the i-1st vocabularies). Here, parameter k represents the size of a sliding window, which determines the context range that the model observes in the text. The value of k can vary according to the requirements of the task; if set to a larger value, the model will consider more contextual information, while a smaller k value will limit the context range of the model.
作者选择了Transformer的Decoder作为GPT模型的核心结构,用来处理输入文本U = (u−k, . . . , u−1)。首先,这些输入词汇会通过一个映射矩阵转化为词嵌入表示,然后加上位置嵌入,以便模型能够理解词汇在句子中的位置信息。接下来,这些词嵌入会被送入Transformer块进行更新,Transformer块负责对输入的词汇进行上下文感知的处理。最后,模型将经过Transformer块的输出传递到全连接层,以生成最终的预测结果。
The author chooses the Transformer’s Decoder as the core structure of the GPT model to process the input text U = (u-k, …, u-1). Firstly, these input vocabularies will be transformed into word embeddings through a mapping matrix, and then added with position embeddings to enable the model to understand the position information of the vocabularies in the sentence. Next, these word embeddings will be sent into Transformer blocks for updating. The Transformer block is responsible for processing the input vocabularies with context awareness. Finally, the model will pass the output of the Transformer block to a fully connected layer to generate the final prediction result.
总的来说,这个过程是通过将输入文本转化为词嵌入、添加位置信息、利用Transformer进行上下文处理、再经过全连接层生成预测值的方式来构建一个强大的语言模型。这个模型可以用于各种自然语言处理任务,根据k的选择,可以适应不同范围的上下文信息。
In summary, this process constructs a powerful language model by transforming the input text into word embeddings, adding position information, using Transformer for context processing, and passing the output through a fully connected layer to generate predictions. This model can be applied to various natural language processing tasks, and can adapt to different contextual information based on the choice of k.
其中,We为词嵌入矩阵,Wp为位置嵌入矩阵,ℎl为第l层 transformer 的输出,ℎn为最后一层 transformer 的输出,n为模型层数。
Among them, We represents the word embedding matrix, Wp represents the position embedding matrix, h_l represents the output of the l-th transformer layer, h_n represents the output of the last transformer layer, and n represents the number of model layers.
有别于基础transformer用的三角函数来做位置嵌入,该论文用的是可学习的位置矩阵来表征位置信息。在实际应用中,这两种方式似乎效果差别不大
Unlike the basic transformer that uses trigonometric functions for position embedding, this paper uses a learnable position matrix to represent position information. In practical applications, the two approaches seem to have similar effectiveness.
在NLP领域中,常见的下游任务包括文本分类、文本蕴含、文本相似度和问答任务。对于这些任务,构建输入序列的方法通常遵循一种通用模式,即在序列的开始和结束位置添加特殊标记,同时使用适当的分隔符将序列分隔开来。需要强调的是,在实际应用中,不会使用字面上的“Start/Extract/Delim”等词汇,而是采用特定的特殊符号来表示这些标记。
In the field of NLP, common downstream tasks include text classification, text inference, text similarity, and question answering tasks. For these tasks, the method of building input sequences typically follows a common pattern: special markers are added at the beginning and end of the sequence, and appropriate separators are used to separate the sequence. It should be emphasized that in practical applications, literal words such as “Start/Extract/Delim” are not used, but specific special symbols are used to represent these markers.
不管是哪种下游任务,输入序列的构建都遵循相似的原则。首先,序列的开始和结束位置会分别添加起始和结束标记。接着,如果有多个输入序列,它们之间通常会使用适当的分隔符来进行分隔,以确保模型能够正确理解每个序列的边界。
Regardless of the downstream task, the construction of input sequences follows similar principles. First, the beginning and end of the sequence will be added with a starting and end marker, respectively. Then, if there are multiple input sequences, they are usually separated by appropriate separators to ensure that the model can correctly understand the boundaries of each sequence.
一旦构建好输入序列,这些序列都会通过预训练的GPT模型进行特征编码。在编码后,通常会使用序列中的最后一个标记的特征向量来进行下游任务的预测。需要注意的是,不同的下游任务可能会有不同的预测层设计,以适应特定任务的需求。
Once the input sequence is built, these sequences will be feature encoded using a pre-trained GPT model. After encoding, typically the feature vector of the last marker in the sequence is used for downstream task prediction. It should be noted that different downstream tasks may have different prediction layer designs to accommodate specific task requirements.
总之,尽管下游任务的输入序列和预测层设计有所不同,但在这个过程中,特征抽取模块通常保持不变,这使得模型具有很好的迁移能力,能够适用于多种不同的NLP任务。
这段描述是关于不同自然语言处理(NLP)任务的处理流程。这些任务包括分类、蕴含、相似度和多项选择。在这些任务中,通过使用Transformer模型和线性分类器来处理输入文本,以得到最终的输出结果。
In summary, although the input sequence and prediction layer design vary for different downstream tasks, the feature extraction module remains unchanged in this process, which enables the model to have good transferability and be applicable to various different NLP tasks.
1 | 对于分类任务,给定一段文本,目标是预测其标签。 |
有监督fine-tuning
在微调阶段,通过给定输入序列X1到Xm,我们的目标是预测Y的概率。这一过程首先涉及将序列输入到预先训练好的模型中,以获得最后一层 transformer 的最后一个 token Xm 的特征 Hlm。接下来,我们使用softmax函数对这些特征进行分类,从而得到最终的预测结果。
In the fine-tuning stage, given input sequence X1 to Xm, our goal is to predict the probability of Y. This process first involves inputting the sequence into a pre-trained model to obtain the features Hlm of the last token Xm in the last layer of the transformer. Next, we classify these features using a softmax function to obtain the final prediction result.
作者在微调阶段的目标函数中采用了两部分损失函数的组合,一部分来自于预训练阶段,另一部分来自于具有监督信息的任务。这种组合损失函数的方式被发现可以在微调过程中取得更好的效果,因此作者将这两部分损失函数相加,并将其用于模型的训练。这样的训练策略有助于提高模型性能。
The author combines two loss functions in the objective function of the fine-tuning stage, one from pre-training and the other from the task with supervisory information. This combination of loss functions is found to achieve better results in the fine-tuning process, so the author adds the two loss functions together and uses it for model training. Such a training strategy helps to improve model performance.
实验
启动
在无监督训练中,我们使用了BooksCorpus数据集,该数据集包含7000多本未发表的书籍,覆盖了多种不同流派,如冒险、幻想、浪漫等。这个数据集的特点是它包含了连续的长文本段落,使我们的生成模型能够学习到长距离的上下文信息,这对于语言模型的性能至关重要。另一个数据集是1B Word Benchmark,与ELMo使用的数据集大小相当,但在预处理过程中被洗牌,导致了破坏了长距离的文本结构。我们的语言模型在前述语料库上表现出了出色的性能,具体而言,它的令牌级困惑度仅为18.4。
This is a passage about natural language processing (NLP) fine-tuning techniques, including the description of BooksCorpus dataset, the language model architecture and the preprocessing process. In the unsupervised training, we used BooksCorpus dataset, which contains over 7,000 unpublished books spanning multiple genres such as adventure, fantasy, and romance. This dataset characterized by long, contiguous text paragraphs allowed our generation model to learn long-distance contextual information, crucial for language model performance. Another dataset is the 1B Word Benchmark, which is on par with the size of the dataset used by ELMo but shuffled during preprocessing, disrupting long-distance text structure. Our language model demonstrates excellent performance on this corpus with a token-level perplexity of only 18.4.
关于模型规格,我们的模型主要基于最初的Transformer架构。我们训练了一个仅包含12个解码器层的Transformer,每个层都包括了768维的状态和12个自注意力头。在位置智能前向网络方面,我们采用了3072维的内部状态。为了优化模型,我们使用了Adam优化算法,其中最大学习率为2.5e-4。学习率在前2000个更新周期内从零线性增加,并随后按照余弦退火计划进行降低。我们使用小批量训练,每个批次包含64个随机抽样的样本,每个样本包含连续512个令牌,并进行了100个时期的训练。模型的权重初始化采用了简单的N(0,0.02)分布。我们的词汇表采用了包含40,000个合并的字节对编码(BPE)。为了防止过拟合,我们对残差、嵌入和注意力进行了0.1的正则化。此外,我们采用了L2正则化的修改版本,对所有非偏差或增益权重的w值设置为0.01。激活函数方面,我们使用了高斯误差线性单元(GELU)。最后,我们还使用了学习的位置嵌入,而不是传统的正弦位置嵌入。在数据预处理方面,我们使用了ftfy库来清理BooksCorpus中的原始文本,标准化了一些标点符号和空格,并采用了spaCy分词器进行分词处理。
In terms of model architecture, our model mainly resembles the original Transformer architecture. We trained a Transformer with only 12 decoder layers, each containing 768 dimensions and 12 self-attention heads. For the position-wise feedforward network, we employed 3072 dimensions. To optimize the model, we used the Adam optimization algorithm with a maximum learning rate of 2.5e-4. The learning rate increases linearly from zero for the first 2000 updates and follows a cosine annealing schedule thereafter. We employed small batch training with 64 randomly sampled examples per batch, each containing 512 consecutive tokens, and trained for 100 epochs. The model weights were initialized using a simple distribution N(0, 0.02). Our vocabulary relied on a merged set of 40,000 byte pair encoding (BPE) units. To prevent overfitting, we applied 0.1 regularization to both residuals and embeddings and attention weights were modified using L2 regularization with a weight of 0.01. The activation function was implemented using the Gaussian Error Linear Unit (GELU). Finally, we employed learned position embeddings instead of traditional sine position embeddings. In terms of data preprocessing, we used the ftfy library to clean up raw text from BooksCorpus, normalized punctuation and spaces, and applied spaCy tokenizer for sentence processing.
我们采用了一致的超参数设置,除非特别注明。在分类器中,我们引入了0.1的丢弃率(dropout rate)。对于大多数任务,我们选择了学习率为6.25e-5和批量大小为32。微调过程相对迅速,通常情况下,进行3次训练已经足够。我们采用线性学习率衰减策略,其中在训练的前0.2%进行预热。此外,我们将λ值设置为0.5。
We used consistent hyperparameter settings unless specified otherwise. In the classifier layer, we introduced a dropout rate of 0.1. For most tasks, we selected a learning rate of 6.25e-5 and a batch size of 32. The fine-tuning process was relatively rapid and typically only required training for 3 epochs. We applied linear learning rate decay with a warmup period accounting for 0.2% of training time. Additionally, we set λ to 0.5.
有监督微调
分析
经过实验证明,增加Transformer模型的解码器层数可以显著提高其性能和泛化能力。这种提升不仅在预训练阶段表现出色,而且在完成有监督任务时也得到了证实。实验结果显示,每增加一个Transformer解码器层,模型的性能都会进一步提高。在多语言自然语言推理(MultiNLI)和RACE数据集上的实验中,我们观察到转移嵌入向量可以增强性能,并且每个Transformer层都提供了额外的性能提升。例如,在MultiNLI上,完全转移模型的性能提高了9%。这表明每个预训练模型中的Transformer层都包含了对解决目标任务有用的功能。
Experimental results have shown that increasing the number of decoder layers in Transformer models can significantly improve their performance and generalization ability. This improvement is not only excellent during pre-training stages but also verified when completing supervised tasks. Experimental results have shown that with every additional Transformer decoder layer, the model’s performance is further enhanced. In experiments on MultiNLI and RACE datasets, we observed that transferred embedding vectors can enhance performance, and each Transformer layer provides additional performance improvements. For example, on MultiNLI, the performance of a fully transferred model improved by 9%. This indicates that each pre-trained Transformer layer contains useful features for solving target tasks.
此外,我们还研究了预训练模型的零样本学习能力,即模型在没有进行微调的情况下对各种任务的性能。实验结果表明,Transformer模型在零样本学习方面表现出色。它具有越来越强的零样本能力,随着预训练的进行而不断增强。这表明底层生成模型通过学习执行各种任务来提高语言建模能力,而结构更为复杂的模型能够更好地受益于这种学习过程,从而实现了零样本学习的能力。Transformer模型的注意力记忆机制对于迁移学习也非常有帮助,相较于LSTM等传统模型,表现更为优越。此外,我们观察到,LSTM在零样本学习方面的性能具有较高的方差,这表明Transformer模型的体系结构对于迁移学习具有更低的归纳偏差,因此更适合应对不同任务的挑战。
In addition, we also studied the zero-shot learning ability of pre-trained models, which refers to the models’ performance on various tasks without fine-tuning. Experimental results showed that Transformer models perform excellently in zero-shot learning. They possess stronger and stronger zero-shot abilities as pre-training proceeds, indicating that the underlying generative model improves language modeling by learning to perform various tasks. Models with more complex structures are able to benefit more from this learning process,从而实现zero-shot learning abilities. The attention memory mechanism of Transformer models is also very helpful for transfer learning, and they perform more superior than traditional models such as LSTM. Moreover, we observed that the performance of LSTM on zero-shot learning has higher variance, indicating that the architecture of Transformer models has lower inductive biases for transfer learning and is therefore more suitable to meet the challenges of different tasks.
我们进行了三个不同的消融研究,以研究我们的方法在不同条件下的性能。首先,我们考察了在没有辅助语言模型目标的情况下我们方法的表现。我们发现,辅助目标对NLI任务和QQP任务的性能有所改善。总体趋势表明,辅助目标对大型数据集有积极影响,但对小型数据集的影响有限。其次,我们将我们的方法与单层2048单元LSTM进行了比较,以探讨Transformer的效果。结果显示,使用LSTM而不是Transformer会导致平均得分下降5.6分,而LSTM仅在MRPC数据集上略优于Transformer。最后,我们直接比较了在没有进行预训练的情况下,Transformer架构和我们的完整模型在监督目标任务上的表现。我们观察到缺乏预训练会显著降低所有任务的性能,导致与我们的全模型相比性能下降了14.8%。
We conducted three different ablation studies to investigate the performance of our method under different conditions. First, we examined the performance of our method without the auxiliary language model objective. We found that the auxiliary objective improves the performance on NLI and QQP tasks. The overall trend indicates that the auxiliary objective has a positive impact on large datasets but limited influence on small datasets. Second, we compared our method with a single-layer 2048-unit LSTM to investigate the effectiveness of the Transformer. The results show that using an LSTM instead of a Transformer leads to an average score decrease of 5.6 points, while the LSTM only slightly outperforms the Transformer on the MRPC dataset. Finally, we directly compared the Transformer architecture and our full model on the supervised target task without pre-training. We observed that the lack of pre-training significantly reduces the performance on all tasks, leading to a performance decrease of 14.8% compared to our full model.
总结
1 | 提出了一种方法框架,它通过整合生成式预训练和判别式微调技术,使得独立任务的模型具备更强大的自然语言理解能力。这一方法依赖于无监督(预)训练,以提升判别任务的性能表现,同时也推动了新的无监督学习研究的发展。 |
我的点评
随着23年年初OpenAI公司chatgpt3大模型的发布,大模型这次是彻底火爆出圈了。几乎所有的互联网公司都在全民AI,我们公司年会制定的战略目标就是一整年的AI大模型。 这不,几乎所有的部门,不管是不是和这个相关的,都在往这个方向靠。
这是我写这篇文章的初衷。这仅仅是个开始,接下来,我会连续写十几篇都是相关的此类文章。
加油,你我!
With the release of the ChatGPT3 large model by OpenAI in early 2023, large models have become extremely popular. Almost all internet companies are engaged in artificial intelligence, and the strategic goal set by our company’s annual meeting is to focus on large AI models throughout the year. As a result, almost all departments, regardless of whether or not they are related to this field, are moving towards this direction.
This is the initial intention for me to write this article. This is just the beginning, and I will follow up with a series of related articles in the next few months.
Let’s work hard together!