简介:
问题:文本生成图像(text-to-image)
以往关注的是在固定数据集上建模:结构,损失函数,辅助信息(标签或mask)等等
- Mansimov et al. (2015):证明DRAW Gregor et al. (2015) 的生成模型可以应用到图像描述- 上(image captions)
- Reed et al. (2016b):使用 Goodfellow et al., 2014 的生成对抗式网络,而不是循环变分- 自动编码器(RVAE),可以提高图像保真度。不仅可以生成已知的物体,还可以零样本泛化到其他- 类别。
- 修改多尺度的生成器(Zhang et al., 2017; 2018)
- 整合了注意力和辅助损失(Xu et al., 2018)
- 利用文本以外的其他信息(Reed et al., 2016a; Li et al., 2019; Koh et al., 2021)
- Nguyen et al. (2017) 基于能量的图像生成框架。可以结合预训练的判别模型,比如MS-COCO数- 据集上预训练的图像描述模型
- Cho et al. (2020) 优化对预训练的跨模态掩码语言模型的输入。
虽然视觉保真度(visual fidelity)显著提高,仍然有一些严重问题:对象失真,对象放置不符- 合逻辑,前景背景的不自然混合
大规模生成模型的进展可能提供了一个改进方向
Prior focus was on modeling on fixed datasets: architecture, loss functions, auxiliary information (labels or masks), etc.
Mansimov et al. (2015): Proved DRAW model of Gregor et al. (2015) can be applied to image captions for generating image descriptions.
Reed et al. (2016b): Employing Generative Adversarial Networks (GANs) of Goodfellow et al., 2014 instead of Recurrent Variational Autoencoder (RVAE) improved image fidelity. Not only known objects can be generated but also zero-shot generalization to other categories is possible.
Modification of multi-scale generator (Zhang et al., 2017; 2018)
Integration of attention and auxiliary losses (Xu et al., 2018)
Leveraging other information than text (Reed et al., 2016a; Li et al., 2019; Koh et al., 2021)
Nguyen et al. (2017) proposed an energy-based image generation framework that can be combined with pre-trained discriminative models, e.g., image captioning models pre-trained on MS-COCO dataset
Cho et al. (2020) optimized inputs for pre-trained cross-modal masked language models.
Although visual fidelity has significantly improved, there are still some serious issues: object distortion, object placement that does not comply with logic, unnatural mixing of foreground and background
Progress in large-scale generative models may provide a way to improve these issues
大规模生成模型的进展可能提供了一个改进方向
- 自回归transformer (Vaswani et al., 2017)(Attention is All you Need) 在多个领域取得了令人瞩目的成果:文本 (Radford et al., 2019)、图像 (Chen et al., 2020)、音频(Dhariwal et al., 2020)。
- 相比之下,文本到图像的生成通常在相对较小的数据集上进行评估,例如 MS-COCO 和 CUB-200(Welinder 等人,2010)。
- 数据集大小和模型大小是否会成为文本生成图像任务的限制因素?
The self-attention-based Transformer (Vaswani et al., 2017) has achieved remarkable results in multiple fields including text (Radford et al., 2019), images (Chen et al., 2020), and audio (Dhariwal et al., 2020).
In contrast, text-to-image generation is typically evaluated on relatively small datasets such as MS-COCO and CUB-200 (Welinder et al., 2010).
Could dataset size and model size be limiting factors for the task of text-to-image generation?
方法:
我们训练一个 120 亿参数的自回归transformer,使用互联网收集的2.5亿个图像文本对,得到一个可用自然语言控制的、灵活、高保真的图像生成模型
We have trained a self-regressive Transformer with 1.2 billion parameters using 250 million image-text pairs collected from the internet, resulting in a flexible, high-fidelity image generation model that can be controlled by natural language.
结果:
在MS-COCO数据集上进行零样本评估,实现了高质量的图像生成,人工评估显示有90%的结果优于以往在该数据集上训练的模型,同时具备执行复杂任务如图像到图像转换的能力。
Zero-shot evaluation on the MS-COCO dataset has achieved high-quality image generation, with human evaluation showing that 90% of the results are better than previous models trained on the same dataset. It also has the ability to perform complex tasks such as image-to-image translation.
生成图像效果
方案:
目标是训练一个transformer(Vaswani et al., 2017),把文本和图像token建模成一个数据流
- 如果把每个像素作为token,高分辨率的图像会占用大量内存
- (Salimans et al., 2017) 似然度目标(likelihood objectives)优先对像素之间的短距离依赖性建模,模型容量大部分用于捕捉高频细节、而不是我们视觉识别的低频结构
为了解决这些问题,我们使用两阶段的训练流程(Oord et al., 2017; Razavi et al., 2019)
- 阶段1:训练一个离散变分自动编码器 (dVAE)
- 将每个 256×256 RGB 图像压缩成 32×32个token的网格,每个token可以假设 8192 个可能值。
- 将 transformer 的上下文大小减少了 192 倍,而视觉质量没有大幅下降
- 阶段2:训练一个自回归transformer
- 将256个BPE编码的文本token,与32*32=1024个图像token连接起来
- 训练一个自回归transformer,建模文本和图像token的联合分布
If each pixel is treated as a token, high-resolution images can occupy a large amount of memory.
(Salimans et al., 2017) likelihood objectives prioritize modeling short-range dependencies between pixels, and a majority of the model capacity is used to capture high-frequency details rather than the low-frequency structure we visually perceive.
To address these issues, we use a two-stage training procedure (Oord et al., 2017; Razavi et al., 2019):
Stage 1: Train a Discrete Variational Autoencoder (dVAE)
Compress each 256x256 RGB image into a grid of 32x32 tokens, where each token can assume 8192 possible values.
Reduce the context size of the transformer by 192 times without significantly sacrificing visual quality.
Stage 2: Train a Self-Regressive Transformer
Connect 256 text tokens encoded using BPE with 32x32=1024 image tokens.
Train a self-regressive transformer to model the joint distribution of text and image tokens.
整个过程相当于最大化证据下限 evidence lower bound (ELB) (Kingma & Welling, 2013; Rezende et al., 2014)
对于模型在图像x、文本描述y、图像token z上的联合似然分布,做因式分解
得到下界
预备知识
自编码器(AE)用于数据重建,而变分自编码器(VAE)扩展了AE,引入了随机性和KL散度损失,以生成多样性的数据样本。
数学建模
数据样本 x = {xi}
假设是独立同分布的,目的是生成更多符合该分布的新数据
需要估计这些样本的分布
假设概率分布有表达式,需要估计表达式的参数
极大化似然估计(使表达式对于尽量多的样本适用)
因为图像数据的分布难以用表达式描述,不能直接估计参数
间接记为 xi 服从 p(x; θ) 分布,θ是要估计的参数
Mathematical modeling
Data sample x = {xi}
Assumed to be independently and identically distributed, the purpose is to generate more new data that conforms to this distribution
Need to estimate the distribution of these samples
Assuming that the probability distribution has an expression, you need to estimate the parameters of the expression
Maximize the likelihood estimation (make the expression applicable to as many samples as possible)
Since the distribution of image data is difficult to describe with an expression, it cannot be directly estimated
Indirectly recorded as xi follows the p(x; θ) distribution, θ is the parameter to be estimated
似然函数 → 取对数 → 求偏导 → 解方程
没有明确的表达式,无法求解
假设θ服从另一个分布,引入新的变量,近似求解
对应VAE
p(z)是隐空间的先验分布
p(x|z;φ) 是解码器
用q(z|x; θ)表示编码器
最后生成结果是p(x)
Corresponding to VAE
p(z) is the prior distribution of the latent space
p(x|z; φ) is the decoder
Use q(z|x; θ) to represent the encoder
The final generated result is p(x)
希望q和p的分布尽量接近,用KL散度描述(最小化)
第一项是是重建Loss,第二项就是我们为了引入生成能力而对Latent空间加的约束
左边的第一项是要最大化似然,第二项也是要最大化(KL散度最小化取负)
最大化左侧,等价于最大化右侧,优化目标就转变成了
阶段1:训练一个离散变分自动编码器 (dVAE)
- 只用图像训练
- VAE的编码器、解码器结构是卷积ResNet
- 将 256×256 RGB 图像压缩成 32×32个token
ELB难以优化,因为qφ是个离散的分布,
one-hot编码,argmax 不可导
我们使用 gumbel-softmax relaxation
向softmax中引入超参数t 使argmax可导
实际图像的像素有范围,而VAE概率分布是实数
pθ用log-laplace分布估计,得到值域 (0,1)
ELB使用Adam优化
ELB is difficult to optimize because qφ is a discrete distribution, one-hot encoding, and argmax is not differentiable. We use Gumbel-Softmax relaxation to introduce a hyperparameter t to make argmax differentiable. The actual pixels of the image have a range, while the VAE probability distribution is real numbers. pθ is estimated using the log-Laplace distribution to obtain a value range of (0,1). ELB uses Adam optimization
训练稳定的重点
t设置为1/16
在编码器末尾、解码器开头使用1*1 卷积
对编码器、解码器resblocks的输出乘以一个小的常数,确保初始化的时候训练稳定
KL权重β增加到6.6,有利于codebook的使用、重建误差更小
Set t to 1/16
Use 1*1 convolutions at the end of the encoder and the beginning of the decoder
Multiply a small constant to the output of the encoder and decoder resblocks to ensure stable training when initializing
Increase the KL weight β to 6.6, which is conducive to the use of codebooks and smaller reconstruction errors
阶段2:训练一个自回归transformer
学习文本图像token的先验分布,
120亿参数的稀疏transformer,只有解码器
对于一个文本-图像对
- 对文本用BPE编码,最多256个token,词典大小16384
- 对图像编码得到32*32=1024个token,词典大小8192
- 图像token是阶段1的dVAE编码器提供的,argmax采样,不添加gumbel噪音
- 两个token拼接到一起,作为一个数据流,自回归的建模
Learning the prior distribution of text-image tokens,
1.2 billion parameter sparse transformer with only decoder
For a text-image pair,
Text is encoded using BPE with a maximum of 256 tokens and a dictionary size of 16,384
Image is encoded to obtain 32x32=1024 tokens with a dictionary size of 8192
Image tokens are provided by the stage 1 dVAE encoder with argmax sampling and no addition of Gumbel noise
The two tokens are concatenated together as a data stream for autoregressive modeling
模型中实现了多层自注意力机制,使每个图像token能够与所有文本token关联,同时使用不同的注意力mask来处理文本和图像数据之间的关系。此外,对于文本长度限制在256个token的情况,模型引入了填充token来处理文本和图像之间的连接,以确保在没有文本时能够填充适当的信息。最后,为了正则化交叉熵损失,模型根据数据批次中的类别数对文本和图像token的损失进行加权,更侧重于图像建模。
The model implements multi-layer self-attention mechanisms, enabling each image token to be associated with all text tokens, and uses different attention masks to handle the relationship between text and image data. In addition, for the case where the text length is limited to 256 tokens, the model introduces padding tokens to handle the connection between text and image, ensuring that appropriate information can be filled when there is no text. Finally, to regularize the cross-entropy loss, the model weighs the loss of text and image tokens according to the number of categories in the data batch, giving more emphasis on image modeling.
数据收集
Conceptual Captions数据集:有330万文本图片对,作为MS-COCO的扩展
为了扩大模型参数到120亿,建立了规模类似JFT-300M的数据集,收集网络的2.5亿文本图片对
Conceptual Captions Dataset: It contains 3.3 million text-image pairs as an extension of MS-COCO.
To expand the model parameters to 1.2 billion, a dataset similar to JFT-300M in scale was established, collecting 250 million text-image pairs using a network.
混合精度训练
为了在16位精度存储下有效训练大参数模型,作者引入了一项关键技术,即每个残差块的梯度缩放(per-resblock gradient scaling)。这技术解决了下溢问题,即从一个残差块到另一个残差块的过程中,随着模型的深度和宽度增加,激活梯度的指数可能会低于16位表示范围的最低值,接近零。下溢问题的解决对于稳定的训练和收敛至关重要。
以前的方法通常是限制整个模型的梯度范围,但这对于大型生成模型来说范围太小。作者通过为每个残差块使用单独的梯度缩放比例来解决这个问题,这意味着不同部分的梯度可以根据需要进行缩放,从而有效地避免了下溢问题,确保训练的稳定性和收敛性。
To effectively train large parameter models with 16-bit precision storage, the authors introduced a key technology, namely per-resblock gradient scaling. This technique addresses the underflow problem, i.e., during the process from one residual block to another, the exponent of the activation gradient may be below the lowest value in the 16-bit representation range as the depth and width of the model increase, approaching zero. The solution to the underflow problem is crucial for stable training and convergence.
Previous methods usually limit the gradient range of the entire model, but this range is too small for large generative models. To solve this problem, the authors used separate gradient scaling ratios for each residual block, which means that gradients from different parts can be scaled as needed, thus effectively avoiding underflow problems and ensuring the stability and convergence of training.
分布式优化
120亿参数的模型需要24GB显存(16-bit精度),超出NVIDIA V100 GPU 16GB
使用参数分片 parameter sharding (Rajbhandari et al., 2019).
- 训练集群上,机器之间的带宽小于同一台机器的GPU带宽
- 训练瓶颈在于all-reduce(平均各个机器的梯度)
- 使用PowerSGD (Vogels et al., 2019),压缩梯度
- 每个GPU独立计算参数分片梯度的低阶因子
- 误差设置为8个GPU的平均梯度和从低阶因子解压的梯度的残差
“12 billion parameters model requires 24GB VRAM (16-bit precision), exceeding the 16GB VRAM of NVIDIA V100 GPUs”
Solution: Use parameter sharding (Rajbhandari et al., 2019).
- Bandwidth between machines on the training cluster is less than the GPU bandwidth on the same machine
- Training bottleneck is in all-reduce (averaging gradients across machines)
- Use PowerSGD (Vogels et al., 2019) to compress gradients
- Each GPU calculates the low-order factors of the parameter shard gradients independently
- Set the error to be the residual between the average gradient across 8 GPUs and the gradient decompressed from the low-order factors
样本生成
- 用对比模型给文本和图像打分
- 候选图片越多,评分后排序的top k 结果越好
实验结果
量化的结果
零样本评估DALL E,对比了三个以往的工作:AttnGAN,DM-GAN,DF-GAN(图表3)
人工评估,对比DF-GAN(图表7)
IS 和 FID 分数(基于Inception Net-V3,输入图片,输出向量,衡量生成图片的质量和多样性)
9a 在MS-COCO高于以往2个点
用dVAE编码器的token训练,可以让模型识别低频轮廓,但是这种压缩不利于产生高频细节
为了定量评估,在9a添加不同半径的高斯滤波器(模糊半径)
9b 在CUB数据集则是非常糟糕,有40个点的差距
We speculate that our zero-shot approach is less likely to compare favorably on specialized distributions such as CUB.
We believe that fine-tuning is a promising direction for improvement, and leave this investigation to future work.
我的想法:CUB图片主要是鸟类近距离特写,对细节的要求很高,几乎没有背景
DALL-E在原理上不擅长生成高频细节
网络图片如果不经过crop,会带有很多背景
9c 如果加上对比模型打分排序,FID会降低
Scores on IS and FID (based on Inception Net-V3, input images, output vectors, measuring the quality and diversity of generated images):
On MS-COCO, our method outperforms previous methods by 2 points
Using dVAE encoder tokens to train the model allows it to recognize low-frequency contours, but this compression is not conducive to producing high-frequency details
To quantitatively evaluate, add different radius Gaussian filters (blur radius) to the 9a
On CUB dataset, our method performs very poorly, with a gap of 40 points
We speculate that our zero-shot approach is less likely to compare favorably on specialized distributions such as CUB.
We believe that fine-tuning is a promising direction for improvement, and leave this investigation to future work.
My thoughts: Images in CUB are mostly close-up shots of birds, with high demands for details and almost no background. DALL-E is inherently not good at generating high-frequency details. Network images often have a lot of background if they are not cropped.
- If we add contrast model scoring and sorting, the FID will decrease
数据重叠分析
专门训练一个计算图片相似度的模型
对每张图片,找出最相似的图片
人工检查结果,设置一个阈值,决定哪些图片应该被删除
Train a dedicated model to compute image similarity
For each image, find the most similar image
Manually inspect the results and set a threshold to decide which images should be deleted
定性发现
模型具有在抽象层次上组合概念的能力
2a手风琴+貘
能够实现组合的泛化
2c圣诞毛衣+小刺猬+遛狗
具有零样本的图片转换能力,由自然语言控制
2d画出速写sketch
另外还有,改变颜色、风格
某些转换体现出模型具有目标分割的能力
The model has the ability to combine concepts at an abstract level
- Can achieve generalization of combination: 2a accordion + tapir
- Has zero-shot image translation ability controlled by natural language: 2c Christmas sweater + hedgehog + walking a dog
- Some translations show that the model has the ability to segment the target: 2d draw a sketch
结论
文生图,自回归的transformer
规模可以改善内容生成:
零样本表现,相对于以往的特定模型
一个生成模型集成了很多功能
Graphical Language Modeling, autoregressive transformer
Scale can improve content generation:
Zero-shot performance compared to previous specific models
A generative model integrates many features
附录
A discrete VAE的细节
结构,训练,对数拉普拉斯分布
B transformer的细节
结构,训练
C 数据收集的细节
D 混合精度训练的指导
E 分布式优化的细节
带宽分析,实施细节
F 人工评估实验的细节
G 零样本图像转换
速写,翻转,近距离特写,
变成红色,戴上墨镜,邮票画风Details of a Discrete VAE
Structure, training, log Laplace distribution
B Transformers details
Structure, training
C Data collection details
D Guidance for mixed precision training
E Distributed optimization details
Bandwidth analysis, implementation details
F Details of the human evaluation experiment
G Zero-shot image translation
Sketches, reversals, close-ups
Turned red, wear sunglasses, stamp style
我的点评
文生图的比较初始版本的论文。
Comparison of initial versions of the Graphical Language Modeling paper.