FlexGen: Flexible Multi-View Generation from Text and Image Inputs

1HKUST(GZ) 2HKUST 3Quwan

* Equal contribution

arxiv preprint

FlexGen is a flexible framework designed to generate high-quality, consistent multi-view images conditioned on a single-view image, or a text prompt, or both. Our method allows editing of unseen regions and modification of material properties through user-defined text.


Abstract

In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. FlexGen tackles the challenges of controllable multi-view synthesis through additional conditioning on 3D-aware text annotations. We utilize the strong reasoning capabilities of GPT-4V to generate 3D-aware text annotations. By analyzing four orthogonal views of an object arranged as tiled multi-view images, GPT-4V can produce text annotations that include 3D-aware information with spatial relationship. By integrating the control signal with proposed adaptive dual-control module, our model can generate multi-view images that correspond to the specified text. FlexGen supports multiple controllable capabilities, allowing users to modify text prompts to generate reasonable and corresponding unseen parts. Additionally, users can influence attributes such as appearance and material properties, including metallic and roughness. Extensive experiments demonstrate that our approach offers enhanced multiple controllability, marking a significant advancement over existing multi-view diffusion models. This work has substantial implications for fields requiring rapid and flexible 3D content creation, including game development, animation, and virtual reality.


Method Overview

Overview of the framework. FlexGen is a flexible framework to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. The system incorporates a 3D-aware annotation method using GPT-4V and an adaptive dual-control module that integrates both a reference input image and text prompts for precise joint control. The condition switcher enhances flexibility, enabling the model to generate multi-view images based on image input, text input, or a combination of both modalities.

3D-aware caption generation pipeline

A 3D object is rendered into four orthogonal multi-view images (front, left, back, right) in a 2×2 grid layout. Using GPT-4V, the agent generates both global and local descriptions. The global description captures the overall attributes of the object and the 3D spatial relationships between its components, while local features detail specific aspects such as color, posture, and texture, thereby enriching the dataset with rich semantic annotations.

Text to multi-view


Effect of the caption


More cases


BibTeX

@article{xu2024flexgen, 
	author = {Xinli Xu and Wenhang Ge and Jiantao Lin and Jiawei Feng and Lie Xu and HanFeng Zhao and Shunsi Zhang and Ying-Cong Chen}, 
	title = {FlexGen: Flexible Multi-View Generation from Text and Image Inputs}, 
        journal={arXiv preprint},
	year = {2024}, 
}