简介:Abstract
问题:零样本学习
大规模语言模型,例如GPT-3,在小样本学习方面表现出色,但在零样本学习方面表现较差,特别是在阅读理解、问答和语言推理等任务中。这差异可能源自于模型在面对零样本的陌生任务时,无法依赖小样本的示例来提供性能参考。此外,由于输入提示的格式与模型预训练数据不同,这也可能导致了零样本性能的下降。
Large-scale language models, such as GPT-3, perform well in small-sample learning, but perform poorly in zero-shot learning, especially in tasks like reading comprehension, question answering, and language reasoning. This difference may come from the fact that the model cannot rely on small sample examples to provide performance references when facing unfamiliar tasks in zero-shot situations. In addition, the format of input prompts being different from the pre-training data of the model may also lead to a decrease in zero-shot performance.
为了提高零样本性能,可以探索一种简单的方法。这种方法旨在使模型更适应零样本任务,例如通过改进输入提示的格式,以更好地与模型的预训练知识相匹配。这样,即使没有小样本的示例,模型也可以更好地理解和执行零样本任务。
To improve zero-shot performance, a simple method can be explored. This method aims to make the model more adapted to zero-shot tasks, for example, by improving the format of input prompts to better match the pre-training knowledge of the model. In this way, the model can better understand and execute zero-shot tasks even without small sample examples.
方法:指令微调(instruction tuning)
通过直觉,我们可以表述一个 NLP 任务,其中我们使用一个预训练模型(大小约为 137B 参数)来执行多达 60 多个不同的自然语言处理(NLP)数据集。这些任务可以用自然语言指令来描述,并且我们将这些任务按照它们的类型进行分组。在评估某一组任务的性能时,我们使用其他分组的数据作为训练数据。在进行微调时,我们使用指令来进一步改善模型,最终命名为 “FLAN - Finetuned Language Net”。
Intuitively, we can describe an NLP task where we use a pre-trained model (with approximately 137B parameters) to perform up to 60 different natural language processing (NLP) datasets. These tasks can be described with natural language instructions, and we group these tasks according to their types. When evaluating the performance of a group of tasks, we use data from other groups as training data. During fine-tuning, we use instructions to further improve the model, ultimately named “FLAN - Finetuned Language Net”.
自然语言指令模板
结果:大幅提升了在陌生任务上的零样本性能
大幅高于137B参数预训练模型,
20/25个数据集高于零样本157B参数GPT-3,
在一些数据集高于小样本GPT-3(ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze)
消融实验表明,指令微调成功的关键是:数据集数量,模型规模,自然语言指令
Significantly higher than a 137B parameter pre-trained model,
20/25 datasets are higher than zero-shot 157B parameter GPT-3,
Outperforms small-sample GPT-3 on some datasets (ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze)
Ablation experiments show that the key to the success of instruction fine-tuning is: the number of datasets, the scale of the model, and natural language instructions.
指令微调是一种简单的方法,结合了预训练-微调、prompting范式的优点
预训练-微调:每个任务一个模型,需要任务特定的例子
prompting:加入小样本示例
GPT1:无监督预训练 → 下游任务微调
GPT2:把下游任务作为条件放到预训练中 P(output|input, task)
GPT3:预训练 → 给出下游任务的例子prompt(小样本)或者不给(零样本),不需要微调
指令微调:预训练 → 在很多任务上指令微调 → 在陌生任务上推理
Pre-training - Fine-tuning: One model per task, requiring task-specific examples
Prompting: Adding small sample examples
GPT1: Unsupervised pre-training → Downstream task fine-tuning
GPT2: Putting the downstream task as a condition in the pre-training P(output|input, task)
GPT3: Pre-training → Provide examples of the downstream task prompt (small sample) or not (zero-shot), no need for fine-tuning
Instruction fine-tuning: Pre-training → Instruction fine-tuning on many tasks → Reasoning on unfamiliar tasks
方案:
FLAN:指令微调改善零样本学习
思路:改善模型对自然语言指令的响应能力
用监督学习、指令描述,教语言模型处理任务。模型学习遵从指令,适用于陌生任务
数据集按照任务类型分组。选一组用于评估,其他组用于微调训练。
Approach: Improve the model’s response ability to natural language instructions
Use supervised learning and instruction descriptions to teach the language model how to process tasks. The model learns to follow the instructions, making it applicable to unfamiliar tasks.
The datasets are grouped according to the type of task. A set is selected for evaluation, and the others are used for fine-tuning training.
任务和模板
将Tensorflow Datasets中的62个文本数据集,涵盖了语言理解和生成任务,综合到一个集合中。针对每个数据集,需要手动设计10个任务描述模板,以自然语言的方式来描述这些任务。为了增加任务描述的多样性,其中3个模板可以包含一些误导性信息。以情感分类任务为例,可以使用一个任务模板来请求生成一篇虚构的影评。在所有数据集的合集上,进行模型微调,使用每个数据集的指令模板来为每条数据生成任务描述。
Combine the 62 text datasets from Tensorflow Datasets into a single collection, covering language understanding and generation tasks. For each dataset, manually design 10 task description templates in natural language to describe these tasks. To increase the diversity of task descriptions, three of the templates can contain some misleading information. As an example of sentiment classification, a task template can be used to request the generation of a fictional movie review. Perform model fine-tuning on the combined set of all datasets, using the instruction templates for each dataset to generate task descriptions for each data point.
评估拆分
需要定义什么是陌生任务
过去的工作按照数据集划分,不在训练集的数据集就是陌生的
我们使用更严格的定义,按照任务类型划分(图表3),只有数据集所属的任务类型不在训练集中、才是陌生的
为了评估c个任务分组的零样本性能,需要指令微调c个模型,每个用对应的一组数据集评估
备注:常识阅读理解 这一组 与 常识、阅读理解 这两组互斥,自然语言推理(NLI)和复述(paraphrase)互斥
It needs to be defined what is a novel task. In previous work, datasets that are not in the training set are considered novel.
We use a stricter definition and divide tasks by type (Figure 3). Only tasks that are not in the training set are considered novel.
To evaluate the zero-shot performance of c task groups, c models need to be fine-tuned with corresponding sets of datasets. Note that the “Common Sense Reading Comprehension” group is mutually exclusive with the “Common Sense” and “Reading Comprehension” groups, and Natural Language Inference (NLI) and Paraphrase are mutually exclusive.
我们用选项后缀来限制表达方法,把OPTIONS作为token加到分类任务的末尾、伴随一个类别列表(图表1)
We use option suffixes to constrain the expression method and add the OPTIONS token to the end of the classification task, accompanied by a category list (Figure 1).
训练细节:模型结构、预训练
模型:LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022).
预训练数据:网络文档(包括代码)、对话、维基百科
文本拆分成2.49T的BPE token,词汇量32k,分词器使用 SentencePiece library (Kudo & Richardson, 2018)
BPE(Byte-Pair Encoding)
分词,统计词频
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
分解单词得到字符,去重,作为词汇表
[“b”、“g”、“h”、“n”、“p”、“s”、“u”]
相邻两个符号组成一对,把高频的符号对合并为一个新的符号,加入词汇表
u和g组成ug
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
重复上一步,直到词汇量达到阈值、或者最高频的频次=1
经过三轮合并后的词汇表 :[ “b”、“g”、“h”、“n”、“p”、“s”、“u”、“ug”、“un”、“hug” ]
Model: LaMDA-PT, a dense left-to-right, decoder-only transformer language model with 137B parameters (Thoppilan et al., 2022).
Pre-training data: web documents (including code), conversations, and Wikipedia.
Text is split into 2.49T BPE tokens with a vocabulary size of 32k, and the tokenizer uses the SentencePiece library (Kudo & Richardson, 2018).
BPE (Byte-Pair Encoding):
Tokenization, frequency statistics:
(“hug”, 10), (“pug”, 5), (“pun”, 12), (“bun”, 4), (“hugs”, 5)
Words are decomposed into characters, duplicates are removed, and the remaining characters form a vocabulary:
[“b”, “g”, “h”, “n”, “p”, “s”, “u”]
Adjacent two characters are combined into a pair, and high-frequency character pairs are merged into a new character, which is added to the vocabulary:
u and g combine to form ug
(“h” “ug”, 10), (“p” “ug”, 5), (“p” “u” “n”, 12), (“b” “u” “n”, 4), (“h” “ug” “s”, 5)
Repeat the above step until the vocabulary size reaches the threshold or the highest frequency is 1:
After three rounds of merging, the vocabulary is: [“b”, “g”, “h”, “n”, “p”, “s”, “u”, “ug”, “un”, “hug”]
训练细节:指令微调过程,基于LaMDA-PT模型
混合所有数据集,并进行随机采样以平衡它们的规模。每个数据集的训练样本数量被限制在30,000个以内,以确保样本比例的均衡。采用样本比例混合方案(参考Raffel等人,2020),其中最大混合率为3,000,即除了前3,000个样本外,其余样本不会获得额外的采样权重。
Mix all the datasets and perform random sampling to balance their sizes. The number of training samples from each dataset is limited to 30,000 to ensure the proportional balance of samples. A sample proportion mixing scheme is adopted (reference Raffel et al., 2020), where the maximum mixing rate is 3,000, i.e., except for the first 3,000 samples, the remaining samples will not receive additional sampling weights.
梯度步长(gradient steps): 30k
batch size: 8192 个token
优化器: Adafactor Optimizer (Shazeer & Stern, 2018)
学习率: 3e-5
输入序列长度: 1024
目标序列长度: 256
使用打包 packing (Raffel et al., 2020) 把多个训练样本合并到一个序列中
用一个特殊的EOS token分隔输入和目标
硬件: TPUv3, 128核
微调训练时间 大约60小时
评估使用的模型(checkpoint)是经过30k steps的最终结果
Gradient steps: 30k
Batch size: 8192 tokens
Optimizer: Adafactor Optimizer (Shazeer & Stern, 2018)
Learning rate: 3e-5
Input sequence length: 1024
Target sequence length: 256
Use packing (Raffel et al., 2020) to merge multiple training samples into a single sequence
Separate the input and target with a special EOS token
Hardware: TPUv3, 128 cores
Fine-tuning training time: about 60 hours
The model (checkpoint) used for evaluation is the final result after 30k steps
结果:
在多个任务上进行评估,每个任务都使用不同的分组训练集。对于每个数据集,我们计算了所有模型的平均性能。我们还进行了提示语工程,通过使用开发集来手动优化提示语,类似于以前的研究工作(Brown等人,2020年)。最后,我们使用开发集上的最佳提示语来评估测试集的性能。
我们将FLAN模型与对照组、零样本方法以及小样本方法进行了对比,其中对照组使用了与GPT-3相同的提示语。在大多数数据集上,我们发现通过指令微调可以显著提高模型的性能。
此外,我们还与零样本的GPT-3 175B(Brown等人,2020年)和GLaM 64B/64E(Du等人,2021年)进行了对比,使用了这些研究中提供的数据。使用最佳开发集的模板,FLAN模型在25个数据集中胜过了零样本GPT-3,并在10个数据集中胜过了小样本GPT-3。在19个数据集中,FLAN模型在13个数据集中胜过了零样本GLaM,并在11个数据集中胜过了one-shot GLaM。
We evaluated on multiple tasks, each using a different grouped training set. For each dataset, we calculated the average performance of all models. We also performed prompt engineering by manually optimizing prompts using the development set, similar to previous research work (Brown et al., 2020). Finally, we evaluated the performance on the test set using the best prompts from the development set.
We compared FLAN models with control groups, zero-shot methods, and few-shot methods. The control group used prompts identical to those used in GPT-3. We found that performance can be significantly improved by instruction fine-tuning on most datasets.
In addition, we also compared with zero-shot GPT-3 175B (Brown et al., 2020) and GLaM 64B/64E (Du et al., 2021) using the data provided in these studies. Using the best template from the development set, FLAN models outperformed zero-shot GPT-3 on 25 datasets and outperformed few-shot GPT-3 on 10 datasets. In 19 datasets, FLAN models outperformed zero-shot GLaM on 13 datasets and outperformed one-shot GLaM on 11 datasets.
总结,指令微调对于可以自然描述为指令的任务是很有效的
例如,自然语言推理NLI,问答QA,翻译,结构转文本
对于直接标注为语言建模的任务,效果较差。指令本身往往是多余的
例如,常识推理、共指消解任务,格式是补全一个不完整的句子或段落
For example, commonsense reasoning and coreference resolution tasks are formulated as completing an incomplete sentence or paragraph.
自然语言推理 NLI:5个数据集,模型根据一些前提、判断假设是否为真。
FLAN大幅超过其他baselines
参考 Brown et al. (2020),GPT-3难于处理NLI的可能原因是,NLI样本较少出现在无监督的训练集里,因此很难被表述为一个句子的延续
对于FLAN,把NLI任务表述为更自然的 “Does
Natural Language Inference (NLI): 5 datasets where the model predicts whether a hypothesis is true given some premises. FLAN significantly outperforms other baselines. Reference Brown et al. (2020), it is difficult for GPT-3 to handle NLI because NLI samples are less likely to appear in the unsupervised training set, making it difficult to express them as a continuation of a sentence. For FLAN, expressing NLI as a more natural “Does
阅读理解:模型回答关于一个段落的问题
闭卷问答:模型回答问题,不能访问包含答案的信息
翻译:与GPT-3类似,训练集约90%是英语,包含其他语言文本(不专门用于训练翻译模型),翻译成英语效果较好,因为使用的是英语句子分词器,大部分预训练数据是英语的
Reading Comprehension: The model answers questions about a passage. Closed-book Question Answering: The model answers questions without access to information containing the answer. Translation: Similar to GPT-3, approximately 90% of the training set is English, containing text in other languages (not specifically designed for training translation models). Translating into English works better because an English sentence segmenter is used, and most of the pre-training data is in English.
其他任务:指令调优的一个局限性是,没有提高许多语言建模任务的性能
例如,常识推理,共指消解,表述为句子补全的形式(附录的表2),FLAN只在3项优于LaMDA-PT,这意味着,如果下游任务与预训练的目标相同(指令很大程度是多余的),指令微调是没用的
Other tasks: One limitation of instruction fine-tuning is that it does not improve the performance of many language modeling tasks. For example, commonsense reasoning and coreference resolution, formulated as sentence completion (Appendix Table 2), FLAN only outperforms LaMDA-PT on 3 tasks, which means that if the downstream task is the same as the pre-training goal (the instruction is largely redundant), instruction fine-tuning is useless.
消融实验
指令调优的(任务)分组数量
NLI、闭卷问答、常识推理这3个组用于评估,其他7个组用于微调训练
备注:没有用复述、常识阅读理解,因为他们与NLI和常识推理过于相似
图表6是结果,横轴:训练使用1~7个组,纵轴:评估使用3个组
加入训练的组越多,评估的性能越好(情感分析这一组除外)
从趋势上看,7个组对性能的提升没有达到饱和,如果加入更多的组,模型性能可能进一步提高
注意:这个实验不能得出哪个组贡献最大的结论(即使情感分析的附加值最小)
Corrected: NLI, closed-book Question Answering, and commonsense reasoning are used for evaluation, while the other seven groups are used for fine-tuning training. Note: Repetition and commonsense reading comprehension were not used because they are too similar to NLI and commonsense reasoning.
Chart 6 shows the results, with the horizontal axis representing the number of groups used for training (1-7) and the vertical axis representing the performance evaluated using the three groups. The more groups are added for training, the better the performance is evaluated (except for the sentiment analysis group). From the trend, it can be seen that the improvement in performance with the addition of seven groups has not reached saturation, and the performance of the model may further improve if more groups are added.
Note: This experiment cannot draw the conclusion that which group contributes the most (even though the added value of the sentiment analysis group is the smallest).
规模定律
Brown et al. (2020) 表明大模型对语言模型的零样本、小样本能力有提升,
我们探索模型规模对指令调优的影响
使用与4.1 相同的任务分组,评估模型参数量为:422M、2B、8B、68B 和 137B,如图表7所示
对于100B参数附近的两个模型,指令调优显著提升性能
对于8B或更小的模型,指令调优反而降低了模型性能
可能的解释,小模型在学习约40个任务时,模型容量会不够用,导致在新任务上表现更差
基于这个解释,对于大模型,指令调优虽然占用了一部分模型容量,但也教会模型如何遵从指令,使他们用剩余容量泛化到新任务
Brown et al. (2020) showed that large models improve zero-shot and few-shot capabilities for language models. We explore the impact of model size on instruction-tuning. Using the same task groups as in section 4.1, we evaluate models with parameters sizes of 422M, 2B, 8B, 68B, and 137B, as shown in Chart 7.
For the two models with around 100B parameters, instruction-tuning significantly improves performance. For models with 8B or fewer parameters, instruction-tuning actually reduces the performance of the model. A possible explanation is that small models run out of capacity when learning about 40 tasks, leading to worse performance on new tasks.
Based on this explanation, for large models, although instruction-tuning takes up some model capacity, it also teaches the model how to follow instructions, allowing them to generalize to new tasks with the remaining capacity.
指令的作用
排除一种可能性:模型提升并非仅仅受多任务微调的影响,即使没有明确的指令。
我们考虑了两种没有指令的微调情况,以进行比较:
无模板设置:在这种情况下,我们进行了翻译任务,但未提供明确的指令。例如,输入 “The dog runs”,输出 “Le chien court”。
数据集名称设置:在这种情况下,我们仍然执行翻译任务,但通过数据集名称来指示任务。例如,输入 “[Translation: WMT’14 to French] The dog runs.”。
我们将这两种设置与FLAN的微调进行对比。FLAN使用自然语言的指令,例如 “Please translate this sentence to French: ‘The dog runs.’”。
我们进行了四个不同任务的评估,包括自然语言推理、阅读理解、闭卷问答和翻译任务。
结果显示,在无模板设置下,零样本推理时使用FLAN的指令是必要的,因为模型需要指令来执行任务。而在数据集名称设置下,我们使用了任务名称和FLAN的指令来进行微调。然而,两种消融配置的性能均远远不及FLAN,这表明指令对于模型在陌生任务上的零样本学习非常重要。
Excluding one possibility: The improvement in the model is not just due to multi-task fine-tuning, even without explicit instructions. We consider two settings without instructions for comparison:
No template setting: In this case, we perform translation tasks without providing explicit instructions. For example, the input is “The dog runs” and the output is “Le chien court”.
Dataset name setting: In this case, we still perform the translation task, but indicate the task through the dataset name. For example, the input is “[Translation: WMT’14 to French] The dog runs.”.
We compare these two settings with FLAN’s fine-tuning. FLAN uses natural language instructions, such as “Please translate this sentence to French: ‘The dog runs.’”.
We performed evaluations on four different tasks, including natural language inference, reading comprehension, closed-book question answering, and translation tasks.
The results show that in the no template setting, the use of FLAN’s instructions is necessary for zero-shot inference, as the model needs instructions to perform the task. In the dataset name setting, we used the task name and FLAN’s instructions for fine-tuning. However, the performance of both ablation configurations is far inferior to FLAN, indicating that instructions are essential for the model’s zero-shot learning on unfamiliar tasks.
相关工作
研究涉及了很多领域:零样本,提示,多任务学习,NLP应用语言模型(附录D)
这里重点描述两个关联最密切的子领域
Research covers many fields: Zero-shot learning, prompting, multi-task learning, NLP application language models (Appendix D)
Here, focus on two subfields most closely related
我们要求模型响应指令的方式,类似基于问答的任务(Kumar et al., 2016; McCann et al., 2018)
为了统一NLP任务,把各种任务转换为基于上下文的问答
虽然非常相似,但他们侧重于多任务学习、而非零样本学习
他们没有考虑在预训练模型中使用现有知识,Liu et al. (2021)
我们的工作在模型规模、任务范围上超过了近期的工作 Chai et al. (2020) and Zhong et al. (2021)
We ask the model to respond to instructions in a way similar to tasks based on question answering (Kumar et al., 2016; McCann et al., 2018)
To unify NLP tasks, convert various tasks into question-based answering
Although very similar, they focus on multi-task learning rather than zero-shot learning
They do not consider using existing knowledge in pre-trained models, Liu et al. (2021)
Our work surpasses recent works in terms of model scale and task scope, such as Chai et al. (2020) and Zhong et al. (2021)
语言模型的成功引发了对模型遵从指令能力的初步研究
近期,Mishra et al. (2021) 在小样本示例和指令数据上微调了140M参数的BART,评估在陌生任务上的小样本能力,类似我们在4.4的小样本指令微调
Ye et al. (2021),没那么强调指令
这些结果表明,在多个任务上微调可以提高陌生任务的小样本性能,即便是小模型
The success of language models has sparked preliminary research into the ability of models to follow instructions
Recently, Mishra et al. (2021) fine-tuned a 140M parameter BART model on small sample examples and instruction data, evaluating its small-sample capability on unfamiliar tasks, similar to our small-sample instruction fine-tuning in 4.4
Ye et al. (2021), less emphasis on instructions
These results suggest that fine-tuning on multiple tasks can improve small-sample performance on unfamiliar tasks, even with smaller models
Sanh et al. (2021) 微调训练T5,设置与我们类似,发现零样本学习在11B参数的模型中可以改进
OpenAI的InstructGPT模型规模与我们类似,通过微调和强化学习训练,产生人类评估者偏好的输出 Ouyang et al., 2022
Sanh et al. (2021) fine-tune T5 training in a setup similar to ours and find that zero-shot learning can be improved in an 11B parameter model
OpenAI’s InstructGPT model has a similar scale to ours, and it is trained through fine-tuning and reinforcement learning to produce outputs preferred by human evaluators Ouyang et al., 2022
讨论
论文探讨零样本提示的一个简单问题:在表述为指令的任务上微调模型,能否提高陌生任务上的性能
用指令微调来解决问题,结合了预训练-微调和提示范式的优点
FLAN模型相对于未做微调的模型提高了性能,大多数任务上超过零样本GPT-3
消融实验表明,任务分组的数量可以提高陌生任务的性能。指令微调的提升只适用于规模足够大的模型
指令微调可以与其他提示方法结合,比如小样本提示、提示微调
大规模语言模型的各种能力引起了人们对专家模型和通才模型的关注,我们的研究有潜在意义
人们可能认为标注数据对于提高专业模型有作用。指令微调展示了如何用标注数据帮模型执行许多陌生的任务。
指令微调对跨任务泛化的提升表明,特定任务训练是通用语言模型的补充,激发了对通才模型的进一步研究
Paper explores a simple question about zero-shot prompting: Can fine-tuning models on tasks expressed as instructions improve performance on unfamiliar tasks?
Using instruction fine-tuning to solve problems combines the advantages of pre-training-fine-tuning and prompt paradigms
FLAN model improves performance compared to GPT-3 without fine-tuning on most tasks
Ablation experiments show that the number of task groups can improve performance on unfamiliar tasks. The improvement from instruction fine-tuning is only applicable to models large enough
Instruction fine-tuning can be combined with other prompting methods, such as few-shot prompting and prompt fine-tuning
The various capabilities of large-scale language models have raised concerns about expert models and generalist models, and our research has potential implications
It may be assumed that labeled data plays a role in improving specialized models. Instruction fine-tuning demonstrates how labeled data can help models perform many unfamiliar tasks
The improvement in cross-task generalization from instruction fine-tuning suggests that task-specific training is a complement to general language models, inspiring further research on generalist models
我们的研究具有一些限制:
任务分组涉及一定的主观性,尽管我们尝试使用公认的文献分类。
我们仅研究了通常为一个句子的较短指令。
个别样本可能出现在预训练数据中,包括网络文档,尽管事后分析表明这并未对结果产生重大影响。
FLAN 137B的规模使得其维护成本相对较高
Some limitations of our research:
Task grouping involves some subjectivity, although we try to use widely recognized literature classification
We only studied short instructions typically consisting of a single sentence
Individual samples may appear in the pre-training data, including web documents, although post-hoc analysis shows that this does not significantly affect the results
The 137B scale of FLAN makes its maintenance relatively high
未来工作
收集/生成更多的任务分组,用于微调、跨语言实验
使用FLAN生成数据,用于训练下游的分类器
用微调改进模型行为的偏差和公平性 bias and fairness (Solaiman & Dennison, 2021).
Future work
Collect/generate more task groups for fine-tuning, cross-lingual experiments
Use FLAN to generate data for training downstream classifiers
Improve the bias and fairness of model behavior with fine-tuning (Solaiman & Dennison, 2021)
结语
本文总结
道德
标注数据集可能包含不良的偏差,偏差可能传播到下游的零样本应用中
指令微调可能减少了对数据和专业知识的需求,降低准入门槛可能带来好处和风险
环保
预训练模型与Austin et al. (2021)相同,耗能是 451 MWh,碳足迹是 26 tCO2e
微调FLAN增加的steps与预训练相比不到2%,因此额外的能量成本很小
Summary
Moral
Labeled datasets may contain harmful biases, which can propagate into zero-shot applications downsteam
Instruction fine-tuning may reduce the need for data and domain expertise, lowering the barrier to entry, which can bring benefits and risks
Environmental impact
The pre-training model is the same as Austin et al. (2021), consuming 451 MWh and having a carbon footprint of 26 tCO2e
The additional steps added by fine-tuning FLAN are less than 2% compared to pre-training, so the additional energy cost is small
附录
附录B 更多的消融实验
任务组的数量不变、增加数据集数量(每组1~4个数据集),每个数据集用1、4、10个模板
增加数据集数量,有明显提升
模板数量的影响不明显,这对于设计模板的初衷是个打击(希望用10个模板泛化到任意的模板上)
猜测:大规模模型不容易在特定任务上泛化
Appendix B: Additional Ablation Experiments
Number of task groups kept constant, increasing the number of datasets per group (1 to 4 datasets) with 1, 4, or 10 templates each. Increased dataset size led to significant improvement. The effect of template quantity was not significant, which is a blow to the original intention of designing templates (hoping to generalize to any template with just 10 templates).
Possible explanation: Large-scale models have difficulty generalizing to specific tasks.
附录C 数据污染分析
预训练数据有超过2T个token,担心其中包含了评估数据集的某些数据,会影响零样本的评估
与GPT-3类似,对数据污染进行事后分析
遵循 Brown et al. 2020,加工一个“干净”的评估数据集:计算 n-gram 覆盖,剔除可能出现在预训练集的样本
横轴:每个数据集的干净样本比例,纵轴:数据清洗前后的性能变化
Appendix C: Data Contamination Analysis
Pre-training data contains over 2T tokens, and there is concern that some of the evaluation dataset data might be present, affecting zero-shot evaluation. Similar to GPT-3, a post hoc analysis of data contamination was conducted.
Following Brown et al. (2020), a “clean” evaluation dataset was processed: calculating n-gram coverage and eliminating samples that might appear in the pre-training set. Performance changes before and after data cleaning were plotted against the clean sample ratio for each dataset.
我的点评
本篇只能算是大模型演进中的一个插曲环节。本身没有翻起太大的风浪。
This article can only be considered as an episode in the evolution of large model. It doesn’t cause too much stir on its own.