Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

¹University of New South Wales ²Australian National University ³Fudan University
⁴Carnegie Mellon University ⁵Alibaba DAMO Academy
CVPR 2024
^*Corresponding Auther

Abstract

Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs. Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls --- human-centric priors such as pose or depth maps --- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts.

BibTeX

@inproceedings{chen2023beyond, title={Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation}, author={Junyan Wang and Zhenhong Sun and Zhiyu Tan and Xuanbai Chen and Weihua Chen and Hao Li and Cheng Zhang and Yang Song}, booktitle={The IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2024}, }

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Given a human-written prompt ("a female dancer in a purple dress is jumping"), existing text-to-image model can
synthesize pictures with perfect human body (e.g., leg formation in this example).

Abstract

Overview Architecture

SDXL-base Results

Qualitative results on larger diffusion model (SDXL-base) using HcP layer. We leverage the pre-trained SDXL-base model for the ``with HcP" model while keeping it frozen.

SD V1-5 Results

Qualitative comparison with baseline methods on three example prompts. We leverage the pre-trained SD v1-5 model for both ``with LoRA" and ``with HcP" models while keeping it frozen.

ControlNet Results

Comparisons with the controllable HIG application. We selected ControlNet as the basic model for controllable HIG and utilized OpenPose image as the conditional input.

BibTeX

Contact

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Given a human-written prompt ("a female dancer in a purple dress is jumping"), existing text-to-image model can synthesize pictures with perfect human body (e.g., leg formation in this example).

Abstract

Overview Architecture

SDXL-base Results

Qualitative results on larger diffusion model (SDXL-base) using HcP layer. We leverage the pre-trained SDXL-base model for the ``with HcP" model while keeping it frozen.

SD V1-5 Results

Qualitative comparison with baseline methods on three example prompts. We leverage the pre-trained SD v1-5 model for both ``with LoRA" and ``with HcP" models while keeping it frozen.

ControlNet Results

Comparisons with the controllable HIG application. We selected ControlNet as the basic model for controllable HIG and utilized OpenPose image as the conditional input.

BibTeX

Contact

Given a human-written prompt ("a female dancer in a purple dress is jumping"), existing text-to-image model can
synthesize pictures with perfect human body (e.g., leg formation in this example).