CoBIT：一种对比度双向图像文本生成模型 CoBIT: A Contrastive Bi-directional Image-Text Generation Model

作者：Haoxuan You Mandy Guo Zhecan Wang Kai-Wei Chang Jason Baldridge Jiahui Yu

视觉和语言领域已经见证了预先训练的基础模型的激增。大多数现有的方法都是用对比目标（如CLIP）、图像到文本生成目标（如PaLI）或文本到图像生成目标（例如Parti）独立预训练的。然而，这三个目标可以在相同的数据、图像和文本对上进行预训练，直观地说，它们是互补的，因为对比提供了全局对齐能力，生成提供了细粒度的理解。在这项工作中，我们提出了一种对比双向图像文本生成模型（CoBIT），该模型试图将三个预训练目标统一在一个框架中。具体而言，CoBIT采用了一种新颖的unicoder解码器结构，由图像unicoder、文本unicoder和跨模态解码器组成。图像/文本单码器可以在不同任务中的编码和解码之间切换，增强灵活性和共享知识，有利于图像到文本和文本到图像的生成。圆面包

The field of vision and language has witnessed a proliferation of pre-trainedfoundation models. Most existing methods are independently pre-trained withcontrastive objective like CLIP, image-to-text generative objective like PaLI,or text-to-image generative objective like Parti. However, the three objectivescan be pre-trained on the same data, image-text pairs, and intuitively theycomplement each other as contrasting provides global alignment capacity andgeneration grants fine-grained understanding. In this work, we present aContrastive Bi-directional Image-Text generation model (CoBIT), which attemptsto unify the three pre-training objectives in one framework. Specifically,CoBIT employs a novel unicoder-decoder structure, consisting of an imageunicoder, a text unicoder and a cross-modal decoder. The image/text unicoderscan switch between encoding and decoding in different tasks, enablingflexibility and shared knowledge that benefits both image-to-text andtext-to-image generations. CoBIT achieves superior performance in imageunderstanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE)and text-based content creation, particularly in zero-shot scenarios. Forinstance, 82.7% in zero-shot ImageNet classification, 9.37 FID score inzero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.

论文链接：http://arxiv.org/pdf/2303.13455v1

更多计算机论文：http://cspaper.cn/

Related posts

字符串上子句的等式定理证明 Equational Theorem Proving for Clauses over Strings

字符串上子句的等式定理证明 Equational Theorem Proving for Clauses over Strings

Agda中系统T的强正规化定理的形式证明 A Formal Proof of the Strong Normalization Theorem for System T in Agda

Agda中系统T的强正规化定理的形式证明 A Formal Proof of the Strong Normalization Theorem for System T in Agda

关于对偶连接和统计流形的扭转/曲率相似 On a Torsion/Curvature Analogue of Dual Connections and Statistical Manifolds

关于对偶连接和统计流形的扭转/曲率相似 On a Torsion/Curvature Analogue of Dual Connections and Statistical Manifolds