研究背景

当前模型存在以下问题

The current model has the following issues:

  • 过度依赖领域内有标签数据:
    我们的预训练模型和精调阶段都需要大量领域特定的标注数据才能取得良好的性能,这导致了高昂的数据标注成本。
  1. Excessive dependence on labeled data within the field: Our pre-training model and fine-tuning stage require a large amount of domain-specific labeled data to achieve good performance, resulting in high data annotation costs.
  • 过度适应特定领域的数据分布:
    在精调阶段,模型过于拟合训练数据的分布,特别是当训练数据有限时,容易出现过拟合问题,降低了模型的泛化能力。
  1. Overfitting to the data distribution of specific domains: During the fine-tuning stage, the model fits too closely to the distribution of training data, especially when the training data is limited, leading to overfitting problems and reducing the model’s generalization ability.
  • 有限的跨领域适用性:
    由于模型在精调阶段对特定领域数据的过度拟合,使得模型在其他领域的适用性受到限制,难以在不同领域中表现出色。
  1. Limited cross-domain applicability: Due to the excessive fitting of the model to the specific domain data during the fine-tuning stage, the model’s applicability to other domains is restricted, making it difficult to perform well in different fields.

简而言之,目前的模型在数据标注成本高、泛化能力差、领域适用性有限等方面存在问题。

In short, the current model has issues such as high data annotation costs, poor generalization ability, and limited domain applicability.

因此GPT-3的主要目标是用更少的领域数据、且不经过精调
GPT-3’s main goal is to achieve better performance with less domain-specific data without undergoing fine-tuning.

之前方案:
多任务学习是一个非常有前景的能够改进泛化能力的框架。– (Caruana, 1997)
第一条路:
适度性能改进-Yogatama et al.,2019
有野心的结果,分别是10 and 17 (dataset, objective)
然后,几百上千的样本收集和目标设定还是比较困难的,所以需要探索额外的多任务学习方法。

Previous approach:
Multi-task learning is a promising framework for improving generalization ability.– (Caruana, 1997)
First method:
Moderate performance improvement - Yogatama et al., 2019
Outcomes with high aspirations, respectively 10 and 17 (dataset, objective)
However, collecting hundreds or thousands of samples and setting objectives is still challenging, so exploring additional multi-task learning methods is necessary.

第二条路
Pre-training+supervised finetuning
word vectors –Mikolov et al., 2013、Collobert et al., 2011
RNN– Dai & Le, 2015、Peters et al., 2018
self-attention Task-Specific 架构–Radford et al., 2018、Devlin et al., 2018

第2.5条路
Zero shot 很少的监督数据就能完成的特殊任务是可行的。commonsense reasoning (Schwartz et al., 2017) and sentiment analysis (Radford et al., 2017).

Second method:
Zero-shot learning is feasible for specific tasks that can be accomplished with little supervision data. Commonsense reasoning (Schwartz et al., 2017) and sentiment analysis (Radford et al., 2017) have shown effective results using this approach.

第三条路:论文的发现
完成下游任务,以zero shot setting方式,无须做任何的监督学习。

方案

核心—语言模型
语言模型是指从样本数据集合中建立起的无监督分布的概率估计;由于语言序列性,可将联合概率分布分解成条件概率分布。–Jelinek& Mercer, 1980、Bengio et al., 2003
最近改进比较大的条件概率计算方案是self-attention Transformer (Vaswani et al., 2017).

Core-Language Model
A language model refers to an unsupervised probability estimation of the distribution built from a sample data set. Since language is sequential, the joint probability distribution can be decomposed into conditional probability distributions. - Jelinek & Mercer, 1980; Bengio et al., 2003
One of the more recently improved schemes for calculating conditional probabilities is the self-attention Transformer (Vaswani et al., 2017).

模型:
单一任务模型:p(output | input)
多任务模型:p(output I input; task)

公式1

Task conditioning实现:

architectural level,encoders-decoders in (Kaiser et al., 2017)

algorithmic level,the inner and outer loop optimization framework of MAML (Finn et al., 2017)

sequence of symbols 使用灵活的序列符号描述任务的输入、输出,McCann et al. 2018

实验预先关注:训练过慢。因为有监督的优化目标就是无监督的最小优化目标。
关注语言模型能够在自然语言中学习到推进以及完成任务。

batchsize of 512训练集:

A promising source of diverse and nearly unlimited text is web scrapes such as Common Crawl
WebText,抓取(45m链接),删除wikipedia去重、清理后有8m文档、45GB text;

输入表示:

–word-level,Al-Rfou et al., 2018
–byte-level,没有优势,Gillicket al. (2015), current byte-level LMs are not competitive
–BPE Byte Pair Encoding (BPE) (Sennrich et al., 2015),兼顾了word-level、byte-level。
BPE能够处理Unicode,适用于任何数据集,而不需要考虑pre-processing, tokenization, or vocab size.

模型:4个参数依次倍增

–Transformer 架构 –Vaswani et al., 2017
–GPT 模型– Radford et al., 2018
–增加两个层:
在self attention block之前加Layer normalization (Ba et al., 2016)
在self attention block之后也增加了layer normalization
–参数调整
residual layer的初始权重调 1/n
vocabulary is expanded to 50,257
context size from 512 to 1024 tokens

实验结果

衡量方法:
BPC、PPL越小越好,ACC 越大越好。

Evaluation Methods: The smaller the BPC and PPL, the better, and the larger the ACC, the better.

结果概述:
–WebText LMs 能够迁移学习,在8个测试集上有7个提升了SOTA。
–大幅提升小数据集的效果(1~2M)Penn Treebank、 WikiText-2。
–大幅提升长依赖数据集 LAMBADA (Paperno et al., 2016) 、the Children’s Book Test (Hill et al., 2015)
–明显跑输的数据集One Billion Word Benchmark (Chelba et al., 2013).
可能原因是数据集比较大,同时存在大量破坏性的预处理。

Result Overview:

  • WebText LMs can transfer learning and improve the SOTA on 7 out of 8 test sets.
  • Significantly improve the performance on small data sets (1-2M) like Penn Treebank and WikiText-2.
  • Significantly improve the performance on long-range dependency data sets like LAMBADA (Paperno et al., 2016) and the Children’s Book Test (Hill et al., 2015).
  • Clearly underperform on the One Billion Word Benchmark (Chelba et al., 2013). The possible reason is that the dataset is relatively large, and there are a lot of destructive preprocessing.

实验图1

Children’s Book Test
–重点检测LM各种词分类别识性能。
GPT-2 SOTA 93.3% on common nouns and 89.1% on named entities.

LAMBADA

检测LM的long-range dependencies性能。
GPT-2 PPL SOTA 99.8 (Grave et al., 2016) to 8.6,ACC from 19% (Dehghani et al.,2018) to 52.66%.

Winograd Schema Challenge
commonsense reasoning, resolve ambiguities
GPT-2 SOTA ACC 增加7%, achieving 70.70%.

Reading Comprehension
answer questions that depend on conversation history
GPT-2 超过 3 out of 4 baseline systems

Summarization
提摘要CNN and Daily Mail dataset (Nallapati et al., 2016).
GPT-2 落后 6.4 points

Translation
检测语言翻译能力learn how to translate from one language to another.
GPT-2 gets 5 BLEU,稍弱于word by word,On the WMT-14 English-French test set

Question Answering
针对事实类问题生成正确答案的能力。Natural Questions dataset (Kwiatkowski et al., 2019)
GPT-2 answers 4.1% 精准匹配的正确率。

讨论及结论

泛化vs记忆

问题:
重复数据会导致系统性能的偏高估计,比如:images net的数据就是偏高了。因此,需要研究测试数据和训练数据的重叠程度,并降低比率。

方法:
Bloom filters 8-grams 处理接近重复的tokens

结果:
测试集与训练集1-6% overlap withWeb-Text train;平均是3.2%的重叠率。

Generalization vs Memory

Problem: Repeated data can lead to an overestimation of system performance, for example, the data in ImageNet is biased. Therefore, it is necessary to study the degree of overlap between the test data and the training data and reduce the ratio.

Method: Use Bloom filters 8-grams to handle tokens that are nearly duplicated.

Result: The overlap rate between the test set and the training set is 1-6% with Web-Text train; the average is 3.2% overlap rate.

结论1

相关研究&后续&结论

相关研究:

性能衡量是在更大数据集上训练更大的数据模型。
Jozefowicz et al. (2016) which scaled RNN based language models on the 1 Billion Word
Benchmark.

Related Research: Performance is measured by training larger data models on larger datasets. Jozefowicz et al. (2016) scaled RNN-based language models on the 1 Billion Word Benchmark.

讨论:

  • 对研究结果进行论述,指出了unsupervised task learning 在前景研究领域中具有极大的探索潜力。
  • GPT-2 作为零样本性能的基线在许多任务上取得了显著成果,但其在fine tuning 方面的潜力尚未完全探明,需要进一步深入研究。
  • 基于先前在GPT fine tuning 方面的成功经验,将进一步探讨额外的训练数据以及提升GPT-2 容量是否能够弥补单向表达所带来的低效问题。

Discussion:

  • Discuss the research results, pointing out that unsupervised task learning has great exploration potential in promising research fields.
  • GPT-2 has achieved significant achievements as a baseline for zero-shot performance on many tasks, but its potential for fine-tuning has not been fully explored and needs further in-depth research.
  • Based on previous success in fine-tuning GPT, further explore whether additional training data and improving the capacity of GPT-2 can compensate for the inefficiency caused by one-way expression.

结论:

1.基于充足的多样化的数据集上训练出来的大语言模型能够执行好跨领域任务和数据集的。
2.GPT-2以zero-shots的方式,在8个测试数据集中,有7个提升了SOTA性能
3.基于充足的多样的文本集训练出来的high-capacity model是能够解决大量任务,而且不需要明确的监督者。

Conclusion:

  1. Large language models trained on sufficient and diverse datasets can perform well across domains and datasets.
  2. GPT-2 improves the SOTA performance in 7 out of 8 test datasets in a zero-shot manner.
  3. High-capacity models trained on sufficient and diverse text sets can solve a large number of tasks and do not require explicit supervision.

我的点评

GPT2其本身只是GPT3的一个过渡产品,本身并没有掀起过多的浪花。因为很快,接近一年后,重磅产品chatgpt3就要发布了。

GPT-2 is just a transitional product of GPT-3, and it doesn’t cause much stir itself. Soon, within a year, the heavyweight product ChatGPT-3 will be released.