Fine-grained Controllable Image Generation Using LVMs

April 2, 2024

Fine-grained control is necessary to generate high-end visuals satisfying the unique requirements of beauty brands because every detail matters. Existing tools, like Stable Diffusion and Midjourney, can only offer very limited control over the image generation by allowing users to use text prompts, which is because “an image is worth thousands of words”.

To achieve fine-grained control, we think that it is necessary to allow users to use multiple formats of controlling information including both texts and images. The challenge is how to design good formats and how to fuse the multiple controlling information into a single image. To this end, we have designed a novel framework called ControlNOLA. 

Currently, under the framework, we design four types of controlling information, which include one text information and three pieces of image information. As shown in the following figure, we design one text controlling module and three image controlling modules to handle the text information and three pieces of image information, respectively. Each module is responsible for controlling certain parts of generated images, and some modules can be optional. We omit the details of the controlling information and how they are fused into the generator here. But it is worth mentioning that the formats of the controlling information are designed by listening to creators/producers’ requirements in the commercial creative industry such that the image generation workflow using our framework is friendly to industry practitioners.

The framework of image generation using ControlNOLA, where we design one text controlling module and three image controlling modules that are integrated into a pre-trained text-to-image generator such as SDXL. Some modules are optional but all the modules together have the strongest and finest-grained control.


  1. Fine-grained control. It offers users fine-grained control over the content of generated product visuals.
  2. Efficient Production. It excels in efficient image production, allowing users to swiftly generate high-quality visuals, thereby streamlining workflow processes.
  3. Professional-Grade Results: It consistently produces professional-grade images, ensuring that the end results meet or exceed industry standards.