HeRA — Head-Wise Representation Alignment
Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors.
We move beyond fixed-layer alignment and intervene at the level of individual attention heads, an atomic unit that mitigates conflicts with the language modeling objective.
A multi-target contrastive loss acts as a differentiable proxy for the MKNN metric, aligning the local neighborhood structure rather than enforcing rigid feature matching.
Counterintuitively, applying HeRA to the least aligned heads (ranked by MKNN before training) yields the largest gains — strengthening weak components while preserving aligned ones.
Up to +3.6 pts on Vision-Centric benchmarks across 9 LLMs (3B–14B), with stable or improved General/Knowledge/OCR performance and reduced visual hallucinations.
HeRA augments the standard language-modeling objective with a contrastive head-level alignment loss that pushes the multimodal representation of an image–caption pair \((I_i, x_i)\), computed at a small set of selected attention heads, toward its \(k\)-nearest neighbors in the latent space of a frozen vision teacher.
Grounded in the Platonic Representation Hypothesis, we measure cross-modal alignment with the Mutual K-Nearest Neighbor (MKNN) metric — the average intersection of the k-nearest neighbor sets induced by the LLM and the teacher vision encoder. MKNN captures local neighborhood agreement rather than global geometric matching.
Since MKNN is non-differentiable, we use a multi-target InfoNCE loss as a differentiable proxy: for each batch element \(i\), the \(k\) visual nearest neighbors of the image \(I_i\) (in teacher space) become positives for the MLLM representation of \((I_i, x_i)\).
We decompose multi-head attention along the rows of the output projection \(W_O\), isolating the contribution of each head before it is summed into the residual stream. This lets HeRA apply the contrastive objective to individual heads, the smallest meaningful unit of an LLM's reasoning machinery.
We rank all heads by their pre-training MKNN score against the vision teacher and pick the five least aligned heads. Empirically, aligning these heads boosts their alignment without degrading already-aligned ones — while aligning the top heads (already platonic) or random heads gives little benefit.
HeRA delivers consistent improvements on the demanding Vision category of the Cambrian benchmark suite, scaling across architectural families (Vicuna, LLama3, Qwen2.5, Qwen3) and model sizes (3B–14B), while preserving or improving General / Knowledge / OCR performance.
| Model | General | Knowledge | OCR | Vision | Δ Vision |
|---|---|---|---|---|---|
| Qwen2.5-3B | 73.5 | 46.7 | 42.2 | 50.5 | |
| + HeRA | 74.5 | 47.5 | 43.8 | 52.9 | +2.4 |
| Qwen3-4B | 75.6 | 49.6 | 43.8 | 56.3 | |
| + HeRA | 76.0 | 50.1 | 44.5 | 58.5 | +2.2 |
| Vicuna-7B | 72.2 | 44.3 | 45.7 | 49.7 | |
| + HeRA | 72.1 | 44.5 | 45.7 | 52.0 | +2.3 |
| LLama3-8B | 73.3 | 45.0 | 43.0 | 53.8 | |
| + HeRA | 74.6 | 46.3 | 44.7 | 55.1 | +1.3 |
| Qwen2.5-7B | 76.2 | 50.2 | 47.9 | 56.7 | |
| + HeRA | 76.5 | 50.5 | 48.6 | 57.4 | +0.7 |
| Qwen3-8B | 74.7 | 49.7 | 43.5 | 55.9 | |
| + HeRA | 76.9 | 51.1 | 47.6 | 59.5 | +3.6 |
| Vicuna-13B | 73.4 | 45.5 | 47.7 | 52.7 | |
| + HeRA | 73.6 | 45.7 | 47.6 | 53.9 | +1.2 |
| Qwen2.5-14B | 75.6 | 50.7 | 44.8 | 54.9 | |
| + HeRA | 77.4 | 52.8 | 49.3 | 58.3 | +3.4 |
| Qwen3-14B | 77.4 | 52.8 | 46.1 | 58.2 | |
| + HeRA | 77.7 | 52.6 | 47.8 | 58.9 | +0.7 |
On Qwen3-8B, HeRA outperforms recent representation alignment strategies — including feature reconstruction (ROSS, JARVIS), middle-layer alignment (VIRAL), and global CKA-based alignment (CMAR) — with the highest gains on the demanding Vision category.
| Alignment | General | Knowledge | OCR | Vision | Δ Vision |
|---|---|---|---|---|---|
| — (LLaVA baseline) | 74.7 | 49.7 | 43.5 | 55.9 | |
| ROSS | 74.6 | 49.4 | 44.0 | 56.2 | +0.3 |
| VIRAL | 73.8 | 49.3 | 43.6 | 54.2 | −1.7 |
| JARVIS | 76.8 | 49.9 | 46.2 | 58.7 | +2.8 |
| CMAR | 76.4 | 51.0 | 46.0 | 56.9 | +1.0 |
| HeRA (Ours) | 76.9 | 51.1 | 47.6 | 59.5 | +3.6 |
Although mitigating hallucinations is not an explicit objective, HeRA consistently reduces hallucination rates on CHAIR-MSCOCO, AMBER (generative & discriminative), and HallusionBench, while preserving cognition/coverage scores. Aligning topology curbs over-reliance on linguistic priors.
Self-supervised teachers work best: DINOv2 / DINOv3 deliver strong, consistent gains, while using SigLIP2 (the primary vision encoder) as teacher is mostly ineffective. Base-size DINOv2-B is already sufficient — scaling the teacher to 1B parameters gives no further benefit.
HeRA visibly improves visual grounding on fine-grained perception, counting, spatial reasoning, and dense scenes — reducing the model's tendency to fall back on linguistic priors.
@article{caffagni2026hera,
title = {{Mind the Heads: Topological Representation Alignment for Multimodal LLMs}},
author = {Caffagni, Davide and Compagnoni, Alberto and Melis, Federico and
Sarto, Sara and Dovesi, Pier Luigi and Granroth-Wilding, Mark and
Cornia, Marcella and Baraldi, Lorenzo},
journal = {arXiv preprint},
year = {2026}
}
This work has been supported by the EU Horizon project ELLIOT (No. 101214398), by the EuroHPC JU project MINERVA (GA No. 101182737), and by the PNRR project ITSERR (CUP B53C22001770006) funded by the EU — NextGenerationEU. We also acknowledge EuroHPC JU for awarding the project EHPC-AIF-2025SC04-225 access to LUMI at CSC, Finland.