GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Abstract

Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results.

Method

Overview of the GaussianIP framework. We combine 3D Gaussian Splatting (3DGS) with a human-centric diffusion prior to realize high-fidelity 3D human avatar generation. (a) We initialize 3D human Gaussians by densely sample from a SMPL-X mesh. Afterward, (b) a human-centric diffusion model is combined with a pose-guide ControlNet to produce AHDS guidance. The AHDS guidance consists of an HDS guidance, which is proposed to achieve better identity-preserving generation, and an Adaptive Human-specific Timestep Scheduling strategy, which accelerates the HDS training. Furthermore, we propose (c) a View-Consistent Refinement Mechanism to further enhance the delicate texture of faces and garments. We guide the denoising of key views $\boldsymbol{x}_0^P$ with attention features from main views $\boldsymbol{x}_0^M$ through Mutual Attention. Next, we align the denoising of an intermediate view $\boldsymbol{x}_0^I$ with that of its neighbor key views via distance-guided attention fusion. Finally, the refined multi-view images are leveraged to optimize the current 3DGS.

Illustration of the optimized weight PDF for sampling timesteps and the corresponding timestep vs. training step (t-i) curve. a) Phase 1, 3 occupy the majority of the training steps, while Phase 2 occupies only a small portion, allowing a quick transition to the detailed texture learning in Phase 3. b) We sample the final timestep between the lower bound and $t_{\text{DG}}$ for each phase. Note that for the geometry phase ($i<500$), we sample between 500 and the maximum timestep to ensure a smooth start.

Qualitative Comparisons

Qualitative comparison results with SOTA text-guided 3D human generation models. Please zoom in for better observation. Note that the baselines cannot handle image prompts, so we compare their text-to-3D results instead. Due to space limitations, please refer to the supplementary materials for the video comparison results.

BibTeX

@misc{tang2025gaussianipidentitypreservingrealistic3d, title={GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior}, author={Zichen Tang and Yuan Yao and Miaomiao Cui and Liefeng Bo and Hongyu Yang}, year={2025}, eprint={2503.11143}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.11143}, }

GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Given a text prompt and a reference image, GaussianIP can generate high-fidelity 3D human avatars in around 40 minutes.

Abstract

Method

Qualitative Comparisons

BibTeX