Published at IJCV 2026

Hallucination Early Detection
in Diffusion Models

1University of Trento  ·  2University of Modena and Reggio Emilia  ·  3University of Pisa
* Equal contribution

HEaD+ predicts whether diffusion models will hallucinate (miss objects) mid-generation. If a hallucination is detected, it restarts with a new seed, saving up to 32% generation time while achieving 6-8% more complete images with all requested objects. It is model-agnostic and works with any diffusion model without retraining: both UNet-based (Stable Diffusion, TokenCompose) and DiT-based (PixArt-α).

Key Results

+8%
Increase in complete generations with 4 objects using SD1.4
32%
Time saved when aiming for a complete image generation
45K
Images in InsideGen dataset with cross-attention maps & PFIs
0
Retraining needed: works out-of-the-box with any diffusion model

The Problem

Diffusion models often fail when generating images with multiple objects. Our analysis shows that with just 4 objects in the prompt, Stable Diffusion 1.4 produces a complete image (showing all objects) only 27% of the time.

The good news? Different random seeds can lead to vastly different results. Testing 11 seeds, at least one produces a complete image in 80% of prompts. The challenge is finding that good seed without wasting time on failed generations.

HEaD+ motivation and pipeline overview

Left: Success rate drops dramatically as object count increases. Right: HEaD+ detects hallucinations mid-generation and restarts with a better seed.

How HEaD+ Works

HEaD+ operates at a critical timestep during the diffusion process, combining three signals to predict whether each requested object will appear in the final image:

  • Predicted Final Image (PFI): a forecast of the final image at an intermediate step
  • Cross-Attention Maps: show where the model is "looking" for each object
  • Textual Embeddings: CLIP features of the requested objects

A lightweight Transformer Decoder processes these inputs to predict object presence. If any object is predicted missing, generation restarts with a new seed, catching failures early and saving up to 37% of generation time.

HEaD+ architecture

Left: HEaD+ pipeline integrating with the diffusion process. Right: Hallucination Prediction network architecture combining PFI, attention maps, and text features.

Predicted Final Image (PFI)

The Predicted Final Image is a key innovation of HEaD+. By projecting the latent representation at an intermediate timestep to the final step, we can glimpse what the final image will look like before generation completes. Already at timestep 16 (out of 50), object presence is clearly visible.

PFI at different timesteps

Predicted Final Images at different diffusion timesteps. By step 16, the final composition is already clear.

Qualitative Comparison

HEaD+ dramatically improves multi-object generation across different base models. When the base model hallucinates (misses objects), HEaD+ detects this and finds a seed that produces a complete image.

InsideGen Dataset

We release InsideGen, a dataset of 45,000 generated images with cross-attention maps and Predicted Final Images at multiple timesteps. Each image includes hallucination labels for prompts with 2-7 objects.

Timesteps captured: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 25, 40

Stable Diffusion 1.4

Cross-attention maps & PFIs for SD 1.4 generations

Download

Stable Diffusion 2.1

Cross-attention maps & PFIs for SD 2.1 generations

Download

Citation

If you find this work useful, please cite:

@article{betti2026head,
  title={Hallucination Early Detection in Diffusion Models},
  author={Betti, Federico and Baraldi, Lorenzo and Baraldi, Lorenzo 
          and Cucchiara, Rita and Sebe, Nicu},
  journal={International Journal of Computer Vision},
  volume={134},
  pages={35},
  year={2026},
  publisher={Springer}
}