背景
当前模型缺陷
对领域内有标签数据的过分依赖:
预训练+精调的两段式框架,需要领域标注数据,否则很难取得不错的效果
标注数据的成本高
Overreliance on labeled data within the domain:
The two-stage framework of pre-training and fine-tuning often necessitates domain-specific labeled data for achieving satisfactory results, making it costly to acquire such data.
领域数据分布的过拟合
在精调阶段,拟合训练数据分布,数据较少->过拟合->泛化能力下降
其他领域使用受限
Overfitting to domain data distribution:
During the fine-tuning phase, the model may overfit the training data distribution, particularly when there’s limited data available, leading to a decrease in generalization ability to other domains.
Fine-tuning移除必要性:
新任务需要大量的标记数据->不利于语言模型应用
微调效果好不能说明预训练模型泛化性好
比如,可能是过拟合预训练的训练数据。
人类接触下游语言任务不需要大量样本,只需要一句对新任务的描述或者几个案例。
当前nlp技术所欠缺的:无缝融合+切换多个任务的能力
New tasks requiring a large amount of labeled data are not conducive to language model applications.
The good performance in fine-tuning does not necessarily indicate strong generalization for pre-trained models. It could be due to overfitting on the pre-training data.
Human adaptability to downstream language tasks often requires only a sentence describing the new task or a few examples.
What current NLP technology lacks is the ability for seamless integration and switching between multiple tasks.
移除fine-tuning方案
meta-learning元学习
比微调的结果差很多。
Large scale transformer
transformer模型参数不断膨胀,参数的增加给模型带来显著的性能提升
有paper提出模型增大,loss在变小。
作者猜想一个更大的transformer模型应该会有更好的学习能力,所以训练一个1750亿参数的自回归语言模型-gpt3,测试这一架设
(同gpt2同原理同架构,参数更大,移除fine-tuning尝试)
Meta-learning, or meta-learning, results in much worse performance compared to fine-tuning.
Large-scale transformer models have continuously grown in the number of parameters, and increasing the parameters significantly boosts the model’s performance. Some papers have suggested that larger models achieve smaller loss.
The authors speculated that a larger transformer model should have better learning capabilities, so they trained a self-regressive language model with 175 billion parameters, called GPT-3, to test this hypothesis (built on the same principles and architecture as GPT-2, but with more parameters, and fine-tuning was removed).
方案
模型架构
与GPT-2相似之处在于模型架构,其中包括模型初始化、归一化和输入编码等方面。与GPT-2的不同之处在于其采用了交替的密集和局部稀疏注意力模式,这种模式类似于Sparse Transformer。
Similar to GPT-2, the model architecture includes aspects such as model initialization, normalization, and input encoding. However, it differs from GPT-2 in that it employs an alternating pattern of dense and locally sparse attention, which is similar to the Sparse Transformer.
模型参数:
共训练8组不同大小模型
研究尺度对模型影响
可用来作为模型越大,性能越好证明
N_parameters: 参数个数
n-layers: 总的层数
d-model: 隐藏层维度
N-heads: attention head的数量
D-head: attention head的维度
Model Parameters:
A total of 8 model configurations were trained for studying the impact of model scale, and this is used to demonstrate that as models get larger, their performance improves. The parameters include:
N_parameters: The number of model parameters.
n-layers: The total number of layers.
d-model: The dimensionality of hidden layers.
N-heads: The number of attention heads.
D-head: The dimensionality of each attention head.
参数规模从1250w一直到1750亿
训练数据
1 | 大小为45TB,过滤后为570GB |
训练数据筛选 步骤
过滤common crawl: 使用高质量参考语料库
模糊去重:文档粒度,包括数据集内部和跨数据集
添加已知高质量语料库进入训练集中
Training Data Filtering Steps
- Filtering Common Crawl: Using a high-quality reference corpus.
- Fuzzy Deduplication: At the document level, including within the dataset and across datasets.
- Adding known high-quality corpora to the training set.
过滤common crawl:
分类器:Spark 分词使用HashingTF特征
模型:LR
正例:WebText 负例: 未经过滤的Common crawl
设置阈值,过滤
Filtering Common Crawl:
- Classifier: Spark, Tokenization using HashingTF Features.
- Model: Logistic Regression (LR).
- Positive Class: WebText.
- Negative Class: Unfiltered Common Crawl.
- Setting a threshold for filtering.
已知问题:
删除训练集和测试集重叠部分。做不到100%删除,存在部分重叠。
训练成本高昂,无法重新训练。
Known Issues:
- Deleting the overlapping part of the training and test sets. Achieving 100% removal is not possible.
- High training costs, making retraining infeasible.
训练过程
超参:
1 | 优化器: adam |
模型评估
评估指标
不同的任务,评估采用的标准和数据集是不同
LAMBADA 和 Storycloze:没有可用的监督训练集,因此从训练集中抽取条件示例并在测试集上进行评估
多项选择题,提供K个上下文加正确完成的示例,然后提供一个仅包含上下文的示例,并比较每个完成的LM似然
Different tasks have different evaluation standards and datasets.
For LAMBADA and Storycloze: There are no readily available supervised training sets, so conditional examples are extracted from the training set and evaluated on the test set.
For multiple-choice questions, K context and correct completions are provided, followed by a context-only example, and the likelihood of each completion by the LM is compared.
评估分类设置
设置one-shot,zero-shot,few-shot三种方式。
GPT1 VS GPT2 VS GPT3模型对比
实验结果
在3种条件下评估GPT-3的性能
Few-shot(FS)
小样本; 给模型10-100个样本; 不能更新权重
Few-shot (FS)
A small number of examples, typically 10-100, without weight updates.
定义:允许输入数条范例和一则任务说明
示例:向模型输入 “这个任务要求将中文翻译为英文。你好 ->hello,再见 ->goodbye,购买 ->purchase,销售 ->”,然后要求模型预测下一个输出应该是什么,正确答案应为 “sell”。
Definition: Allows inputting several examples and a task description.
Example: Present the model with the task description “Translate Chinese to English” and provide examples like “你好 -> hello, 再见 -> goodbye, 购买 -> purchase, 销售 ->” and ask the model to predict the next output, with the correct answer being “sell.”
One-shot(1S)
1个样本;不会更新权重;
One-shot (1S)
Only one example provided, without weight updates.
定义:只允许输入一条范例和一则任务说明
示例:向模型输入 “这个任务要求将中文翻译为英文。你好 ->hello,销售 ->”,然后要求模型预测下一个输出应该是什么,正确答案应为 “sell”。
Definition: Permits only a single example and a task description.
Example: Feed the model with the task description “Translate Chinese to English” and a single example like “你好 -> hello, 销售 ->” and ask the model to predict the next output, with the correct answer being “sell.”
Zero-shot(0S)
最接近人类,只有自然语言描述的问题,没有样例 ; 不会权重更新。
Zero-shot (0S)
Closest to human understanding, no examples, no weight updates.
定义:不允许输入任何范例,只允许输入一则任务说明
示例:向模型输入 “这个任务要求将中文翻译为英文。销售 ->”,然后要求模型预测下一个输出应该是什么,正确答案应为 “sell”。
Definition: Does not allow inputting any examples, only a task description.
Example: Provide the model with the task description “Translate Chinese to English” and ask it to predict the next output without any examples, with the correct answer being “sell.”
结论
参数越大,few-shot表现越好。
1750参数曲线表现最好。
Larger model parameters lead to better few-shot performance.
The curve with 1750 parameters performs the best.
传统语言建模
单词预测,句子和段落补全,完形填空等
Word prediction, sentence and paragraph completion, fill-in-the-blanks, and more.
PTB数据集:GPT3 zero shot 性能比state of the art提高整整15分
Lambada数据集:阅读一个段落,预测最后一个单词。Few-shot acc效果没有超过sota,但是也超过zero-shot,one-shot。
In the LAMBADA dataset, the task is to read a passage and predict the last word. The few-shot accuracy of GPT-3 doesn’t surpass the state-of-the-art (SOTA) performance but does outperform zero-shot and one-shot approaches.
不同参数下gpt-3 zero-shot, one-shot,few-shot的在lambda数据集性能比较
Comparing the performance of GPT-3 in zero-shot, one-shot, and few-shot settings on the LAMBADA dataset under different parameters is the focus of the evaluation.
问答QA任务
用来测试事实性的知识,即你问我答。
在triviaQA任务:
one-shot,few-shot都超过了fine tuned state of the art。
翻译任务
1 | 表现好: |
翻译能力和模型参数
参数指数增->性能线性增
注:BLEU 是用来评估机器翻译的指标
The translation ability of the model improves linearly with an exponential increase in parameters.
Note: BLEU is a metric used to evaluate machine translation.
数据污染预防与评估
预防:
检测互联网数据集中的测试污染,特别是在面对大规模训练数据并且无法重新训练模型的情况下,是一个尚未解决的挑战。在此领域,目前还没有确立的最佳实践方法。
为了应对污染问题,OpenAI曾试图通过删除训练数据中与基准测试集重叠的部分来解决。然而,由于一个技术漏洞,这一尝试只能部分地删除掉已检测到的重叠部分。而且,由于重新训练模型的高昂成本,这并不是一个可行的解决方案。
Prevention:
Detecting test contamination in internet-scale datasets, especially when working with large-scale training data and unable to retrain the model, is an unresolved challenge in the field. Currently, there are no established best practices in this area.
In an attempt to address the contamination issue, OpenAI had previously tried to remove the portions of training data that overlapped with benchmark test sets. However, this attempt was only partially successful due to a technical loophole. Moreover, the high cost of retraining the model made this an unfeasible solution.
评估
GPT-2 重叠分析:训练和测试数据集有一定的重叠,但受到污染的数据仅有几个百分点,对结果没有产生显著影响。
GPT-3 数据量巨大,训练集过度拟合程度并不显著。预计污染可能会经常发生,但其影响可能不会像人们担心的那样大
Gpt3-训练曲线:对比训练loss和验证集loss,虽然存在一定的gap,但随着模型大小和训练时间的增加gap变化幅度很小,这表明过拟合基本没有,即数据污染影响小。
GPT-2 Overlap Analysis: The training and test datasets had some degree of overlap, but the contaminated data accounted for only a small percentage. This did not significantly affect the results.
With GPT-3’s vast amount of data, the extent of overfitting in the training set is not substantial. It is expected that contamination might occur frequently, but its impact may not be as significant as initially feared.
局限性
一:文本合成
虽然总体质量很高,但 GPT-3 有时仍会在语义上重复自己,在足够长的段落中开始失去连贯性,自相矛盾
- 1: Language model misuse * *
For example, disinformation, spam, phishing, misuse of laws and government programs, fraudulent academic thesis writing, social media disinformation bots
Security Risk: Samsung Chip Leaks Confidential Information
二:离散语言任务
在“物理常识”方面有特殊的困难.举例: 具体来说,GPT-3 在“If I put cheese into the fridge, will it melt?”这类问题上有困难 。
- II: Equity and prejudice * *
Data bias can lead to model bias and may harm some people.
Gender: The occupations of neutral variables are more likely to be male than female.
83% of the 388 occupations tested were more likely to be male. Female occupations may include midwives, nurses, receptionists, housekeepers, etc.
Race: Explore how ethnicity affects emotions, using Senti WordNet to measure emotions to determine disproportionate vocabulary that appears in each race. In each race, clear words are scored differently. Asian, positive word score is higher.
Religion: Studied which words appear in conjunction with religious terms such as atheism, Buddhism, Christianity, Hinduism, Islam and Judaism. In the case of Islam, the term “violence,” “terrorism” is more relevant to “Islam” than to other religions.
三:结构和算法上的局限性
专注于自回归语言模型中的上下文学习行为, 采样和计算似然函数都很简单, 在包括填空任务,涉及查看并比较两个内容的任务,或需要重新阅读或仔细考虑长篇文章然后生成非常简短答案的任务表现差。
- III: Energy * *
Actual large-scale pre-training requires a lot of calculations, which is energy intensive.
Example: 10 blocks of v100 gpu calculation power, training gpt3 takes 10 years.
Example: Model distillation reduces the cost of such models.
四:few-shot 学习一个局限
不确定 few-shot 学习是否实际上会在推理时“从零开始”学习新任务
- 4: few-shot Learn a limitation * *
Not sure if few-shot learning actually learns new tasks “from scratch” when reasoning
五:成本
昂贵且难以进行推理-(1500w美金)。
解决这个问题的一个可能的未来方向是将大型模型蒸馏为特定任务的小模型。
- Five: Cost * *
Expensive and difficult to reason with - ($1500w).
A possible future direction for solving this problem is distillation of large models into task-specific small models.
六:可解释性
存在一些深度学习系统普遍存在的限制,它的决策不容易解释
- Six: Interpretability * *
There are some limitations that prevail in deep learning systems, and their decision-making is not easy to interpret
影响力
一:语言模型误用
例如,虚假信息,垃圾邮件,网络钓鱼,滥用法律和政府程序,欺诈学术论文写作, 社交媒体假信息机器人
安全风险:三星芯片涉密信息泄露
- 1: Language model misuse * *
For example, disinformation, spam, phishing, misuse of laws and government programs, fraudulent academic thesis writing, social media disinformation bots
Security Risk: Samsung Chip Leaks Confidential Information
二:公平与偏见
数据偏差可能导致模型偏见,可能伤害到一些人。
性别:中性变量的职业是被男性概率比女性更高。
测试的388种职业中,有83%的职业更有可能是男性。可能是女性职业包括助产士、护士、接待员、管家等。
种族:探究种族如何影响情绪,使用Senti WordNet 来测量情绪,以确定在每个种族中出现的不相称的词汇。每个种族中,清晰词的分数不同。亚洲人,积极的词分数较高。
宗教:研究了哪些词与无神论、佛教、基督教、印度教、伊斯兰教和犹太教等宗教术语共出现。以伊斯兰教为例,“暴力”、“恐怖主义” 等词与“伊斯兰”相关的比例要高于与其他宗教相关的比例。
- II: Equity and prejudice * *
Data bias can lead to model bias and may harm some people.
Gender: The occupations of neutral variables are more likely to be male than female.
83% of the 388 occupations tested were more likely to be male. Female occupations may include midwives, nurses, receptionists, housekeepers, etc.
Race: Explore how ethnicity affects emotions, using Senti WordNet to measure emotions to determine disproportionate vocabulary that appears in each race. In each race, clear words are scored differently. Asian, positive word score is higher.
Religion: Studied which words appear in conjunction with religious terms such as atheism, Buddhism, Christianity, Hinduism, Islam and Judaism. In the case of Islam, the term “violence,” “terrorism” is more relevant to “Islam” than to other religions.
三:能源
实际的大规模预训练需要大量的计算,这是能源密集型的。
举例:10块 v100 gpu算力,训练gpt3需要10年。
举例:模型蒸馏降低此类模型的成本。
- III: Energy * *
Actual large-scale pre-training requires a lot of calculations, which is energy intensive.
Example: 10 blocks of v100 gpu calculation power, training gpt3 takes 10 years.
Example: Model distillation reduces the cost of such models.
相关工作
一: 增加语言模型的参数数量和计算量研究
直接增加 Transformer 模型的大小,等比例扩展参数和计算算力
侧重于增加参数数量而没有增加计算量
增加计算量而不增加参数
- 1: * * Increase the number of parameters and calculation volume of language models
Directly increase the size of the Transformer model, equal proportional expansion of parameters and computational force
Focus on increasing the number of parameters without increasing the amount of calculation
Increase the amount of computation without adding parameters
- 1: * * Increase the number of parameters and calculation volume of language models
二: 规模对语言模型性能的影响研究
随着自回归语言模型的扩大,损失呈现出平滑的幂律趋势
- II: * * Scale Impact Study on Language Model Performance
As the autoregressive language model expands, the loss shows a smooth power law trend
- II: * * Scale Impact Study on Language Model Performance
三:减少模型内存研究
模型蒸馏方法
III: Reducing Model Memory Studies
Model distillation method
四:新的few-shot 学习的方式
探索将预训练语言模型与梯度下降相结合进行 随着自回归语言模型的扩大,损失呈现出平滑的幂律趋势研究
The New Way to Learn few-shot
Exploring the Combination of Pre-Trained Language Model and Gradient Decline The Research on the Power Law Trend of Loss with the Expansion of the Self-regression Language Model
五:OpenAI 继续专注于纯自回归语言模型
既为了专注于上下文学习性能,也为了减少大型模型实现的复杂性
5: OpenAI continues to focus on purely autoregressive language models
Both to focus on context learning performance and to reduce the complexity of large model implementations
我的点评
这就是最近大火的chatgpt3的算法模型,模型部分目前不开源。
This is the algorithm model of the chatgpt3 of the recent fire, which is currently not open source.