What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

¹University of Modena and Reggio Emilia, Italy, ²University of Pisa, Italy, ³University of Trento, Italy, ⁴IIT-CNR, Italy

Abstract

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability.

To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.

Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.

Method Overview

Our DICE framework consists of two main components:

Difference Detector: Identifies localized differences between the original and edited images
Coherence Estimator: Assesses the relevance of detected changes with respect to the editing prompt

Both components are built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a combination of self-supervision, distillation from inpainting networks, and full supervision.

Difference Detection

The difference detection module in DICE uses a Multimodal Large Language Model (MLLM) to identify and localize semantic changes between original and edited images at the object level. The task is framed as structured text generation, where the model outputs (command, object, bounding box) triplets describing localized modifications categorized as ADD, REMOVE, or EDIT. These predictions are independent of the user prompt and serve as the basis for coherence evaluation.

The model is trained in two stages:

Stage 1: Trained on visually similar image pairs from the LVIS dataset, selected using DINOv2 embeddings. The model learns to detect object-level differences by identifying which instances appear in one image but not in the other.
Stage 2: Fine-tuned on synthetically edited image pairs created via LaMa and Kandinsky inpainting. This stage introduces controlled object-level operations—additions, deletions, and replacements—allowing the model to generalize to diverse edit types.

Coherence Estimation

The coherence estimation module in DICE evaluates whether each detected object-level change aligns with the user's editing instruction. It builds upon the same MLLM architecture as the difference detector and takes as input the original and edited images, the localized bounding box of the change, and the associated modification type. The model outputs a binary decision (YES/NO) along with a textual rationale, determining if the modification is semantically consistent with the prompt.

The model is trained with manually annotated samples from the EmuEdit dataset are used to provide ground-truth labels of detected differences and coherence for object-level changes. These include both binary labels and natural language explanations.

Quantitative Results

Table 1: Difference Detection Performance — Performance comparison of various MLLMs in the difference detection stage of our pipeline, evaluated under both class-agnostic and class-aware settings. Results are presented in terms of AP metrics across various training configurations.

Table 3: Coherence Estimation Results — Benchmark comparison of model rankings generated by DICE and those derived from the user study. The first two columns contrast average human ratings with scores obtained using DICE. The final column compares the percentage of unchanged images in the user study -- cases with maximal background preservation and minimal prompt adherence -- to the corresponding one identified by DICE.

BibTeX

@inproceedings{baraldi2025dice, title = {What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models}, author = {Baraldi, Lorenzo and Bucciarelli, Davide and Betti, Federico and Cornia, Marcella and Baraldi, Lorenzo and Sebe, Niculae and Cucchiara, Rita}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025} }