Coherence Estimation
The coherence estimation module in DICE evaluates whether each detected object-level change aligns with the user's editing instruction. It builds upon the same MLLM architecture as the difference detector and takes as input the original and edited images, the localized bounding box of the change, and the associated modification type. The model outputs a binary decision (YES/NO) along with a textual rationale, determining if the modification is semantically consistent with the prompt.
The model is trained with manually annotated samples from the EmuEdit dataset are used to provide ground-truth labels of detected differences and coherence for object-level changes. These include both binary labels and natural language explanations.





