ICCV Logo
Icon

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Icon
1University of Modena and Reggio Emilia, Italy, 2University of Milan, Italy,

We propose ScanDiff, a unified architecture that integrates diffusion models with Vision Transformers to generate diverse and realistic gaze scanpaths. Unlike existing approaches, ScanDiff explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, enabling the generation of diverse yet plausible gaze trajectories.

ScanDiff Teaser

Abstract

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics.

While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths.

Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives.

Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research.

Method Overview

ScanDiff Method Overview

ScanDiff is based on a unified architecture combining Diffusion Models with Transformers. It represents the first diffusion-based approach for scanpath prediction on natural images. Textual conditioning allows ScanDiff to work both in free-viewing and task-driven scenarios, and a dedicated length prediction module is introduced to handle variable-length scanpaths.

How does the model work?

  1. Scanpath embedding: The scanpath is embedded into the initial uncorrupted latent variable z0.
  2. Forward diffusion: Gaussian noise is added to the embedded sequence z0 over T timesteps.
  3. Visual encoding: The stimulus I is encoded with a Transformer-based backbone (DINOv2).
  4. Task encoding: A textual encoder (CLIP) processes the viewing task c.
  5. Multimodal fusion: Visual and textual features are projected into a joint multimodal embedding space.
  6. Denoising: A Transformer encoder refines the noised scanpath embedding zt, conditioned on the multimodal features.
  7. Reconstruction: A three-layer MLP γθ reconstructs the scanpath, and a length prediction module θ estimates its length.

Qualitative Results

Scanpath Variability Analysis

Human visual exploration is inherently variable. Individuals perceive the same stimulus in different manners depending on factors such as attention, context, and cognitive processes. Capturing such variability is essential for developing models that accurately reflect the diverse range of human traits. However, existing scanpath prediction models tend to align closely with the statistical mean of human gaze behavior. While this approach may improve performance on traditional evaluation metrics, it fails to reflect the natural variability in human visual attention. Commonly used metrics such as MM, SM, and SS tend to reward predictions that closely match an aggregated ground truth, thus favoring models that generate a single representative scanpath. Indeed, the average similarity between ground-truth scanpaths can be smaller than the average similarity between generated scanpaths if these well reflect an average behavior.

We propose the Diversity-aware Sequence Score (DSS), a new metric that extends the standard sequence similarity mesures by incorporating a term that penalizes excessive similarity among the generated scanpaths when humans do not reflect such behavior. Given a set of generated scanpaths sg and corresponding human scanpaths sh for a specific visual stimulus, DSS is computed as:

metric

This is a first attempt to quantitatively assess the ability of a model to generate diverse, yet human-like, gaze trajectories. ScanDiff achieves the best overall performance on all settings and datasets, highlighting its effectiveness in predicting accurate eye movement trajectories well aligned with the human scanpath variability. Goal-oriented scanpaths tend to be more deterministic, particularly in the target-present setting, and are generally shorter than those in free-viewing scenarios. Nevertheless, our model effectively captures even the more subtle variability present in human gaze behavior.

BibTeX

@inproceedings{cartella2025modeling,
  title     = {Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction},
  author    = {Cartella, Giuseppe and Cuculo, Vittorio and D'Amelio, Alessandro and Cornia, Marcella and Boccignone, Giuseppe and Cucchiara, Rita},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}