Text2Video-Zero:文本到图像扩散模型是零样本视频生成器 Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

作者:Levon Khachatryan Andranik Movsisyan Vahram Tadevosyan Roberto Henschel Zhangyang Wang Shant Navasardyan Humphrey Shi

最近的文本到视频生成方法依赖于计算量大的训练,并且需要大规模的视频数据集。本文介绍了一种新的零样本文本到视频生成任务,并通过利用现有文本到图像合成方法(如稳定扩散)的强大功能,提出了一种低成本的方法(无需任何训练或优化),使其适合视频领域。我们的关键修改包括(i)用运动动力学丰富生成的帧的潜在代码,以保持全局场景和背景时间一致;以及(ii)使用第一帧上的每个帧的新的跨帧注意力来重新编程帧级自注意力,以保持前景对象的文本、外观和身份。实验表明,这导致了低开销、高质量和显著一致的视频生成。此外,我们的方法不仅限于文本到视频的合成,还适用于其他任务,如条件和控制

Recent text-to-video generation approaches rely on computationally heavytraining and require large-scale video datasets. In this paper, we introduce anew task of zero-shot text-to-video generation and propose a low-cost approach(without any training or optimization) by leveraging the power of existingtext-to-image synthesis methods (e.g., Stable Diffusion), making them suitablefor the video domain. Our key modifications include (i) enriching the latent codes of the generatedframes with motion dynamics to keep the global scene and the background timeconsistent; and (ii) reprogramming frame-level self-attention using a newcross-frame attention of each frame on the first frame, to preserve thecontext, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality andremarkably consistent video generation. Moreover, our approach is not limitedto text-to-video synthesis but is also applicable to other tasks such asconditional and content-specialized video generation, and VideoInstruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better thanrecent approaches, despite not being trained on additional video data. Our codewill be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .

论文链接:http://arxiv.org/pdf/2303.13439v1

更多计算机论文:http://cspaper.cn/

Related posts