MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

1ReLER, CCAI, Zhejiang University 2Huawei Technologies Ltd.
CVPR 2024

*Corresponding Author
MY ALT TEXT

Multi-Instance Generation Controller (MIGC) is a plug-and-play controller that enables stable diffusion with precise position control while ensuring the correctness of various attributes like color, shape, material, texture, and style in Multi-Instance Generation (MIG). It can also control the number of instances and improve interaction between instances.

Abstract

We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image. Given a set of predefined coordinates and their corresponding descriptions, the task is to ensure that generated instances are accurately at the designated locations and that all instances' attributes adhere to their corresponding description. This broadens the scope of current research on Single-instance generation, elevating it to a more versatile and practical dimension. Inspired by the idea of divide and conquer, we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. Initially, we break down the MIG task into several subtasks, each involving the shading of a single instance. To ensure precise shading for each instance, we introduce an instance enhancement attention mechanism. Lastly, we aggregate all the shaded instances to provide the necessary information for accurately generating multiple instances in stable diffusion (SD). To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline. Extensive experiments were conducted on the proposed COCO-MIG benchmark, as well as on various commonly used benchmarks. The evaluation results illustrate the exceptional control capabilities of our model in terms of quantity, position, attribute, and interaction.

Gallery

Method

MY ALT TEXT

Our MIGC stands upon the pre-trained T2I diffusion model. Stable diffusion's UNet inputs text description and image features into the Cross-Attention layer to obtain the residual feature and then adds it to the image features to determine generated content, which is like a shading process (i.e., coloring with parallel pencil lines or a block of color). In this view, Multi-Instance Generation can be considered multi-instance shading on image features, and MIGC comprises three steps: (a) Divide MIG into single-instance shading subtasks. (b) Conquer single-instance shading with Enhancement Attention. (c) Combine shading results through Layout Attention and Shading Aggregation Controller.

COCO-MIG Results

MY ALT TEXT
MY ALT TEXT

COCO-Position Results

MY ALT TEXT MY ALT TEXT

DrawBench Results

MY ALT TEXT MY ALT TEXT

BibTeX


    @misc{zhou2024migc,
      title={MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis}, 
      author={Dewei Zhou and You Li and Fan Ma and Xiaoting Zhang and Yi Yang},
      year={2024},
      eprint={2402.05408},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }