Mind the Heads: Topological Representation Alignment for Multimodal LLMs

HeRAHead-Wise Representation Alignment

Davide Caffagni†1 Alberto Compagnoni†1,2 Federico Melis†1 Sara Sarto1
Pier Luigi Dovesi3 Mark Granroth-Wilding3 Marcella Cornia1 Lorenzo Baraldi1
1University of Modena and Reggio Emilia  ·  2University of Pisa  ·  3AMD Silo AI
Equal contribution
HeRA teaser figure.
Figure 1. Standard representation alignment imposes strict vision–language feature matching (left), while HeRA (center) matches cross-modal local neighbors, leading to superior VQA results (right).

Abstract

Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors.

TL;DR. Instead of aligning a fixed LLM layer to vision features, HeRA aligns the topology (local neighborhoods) of a handful of worst-aligned attention heads with a frozen vision teacher via a contrastive loss. Consistent gains on vision-centric VQA and improved hallucination scores — on 9 LLMs from 3B to 14B.

Key Contributions

Granularity

Head-Wise Alignment

We move beyond fixed-layer alignment and intervene at the level of individual attention heads, an atomic unit that mitigates conflicts with the language modeling objective.

Objective

Topology, not Features

A multi-target contrastive loss acts as a differentiable proxy for the MKNN metric, aligning the local neighborhood structure rather than enforcing rigid feature matching.

Selection

Pick the Worst Heads

Counterintuitively, applying HeRA to the least aligned heads (ranked by MKNN before training) yields the largest gains — strengthening weak components while preserving aligned ones.

Results

Robust Gains, Less Hallucination

Up to +3.6 pts on Vision-Centric benchmarks across 9 LLMs (3B–14B), with stable or improved General/Knowledge/OCR performance and reduced visual hallucinations.

Method

HeRA augments the standard language-modeling objective with a contrastive head-level alignment loss that pushes the multimodal representation of an image–caption pair \((I_i, x_i)\), computed at a small set of selected attention heads, toward its \(k\)-nearest neighbors in the latent space of a frozen vision teacher.

HeRA method overview.
Figure 2. Overview of HeRA. Alongside the standard language modeling objective \(\mathcal{L}_{\text{LM}}\), HeRA employs a contrastive loss \(\mathcal{L}_{\text{HeRA}}\) to pull representations from selected LLM attention heads closer to their \(k\)-nearest neighbors (Top-\(k\)), computed in the latent space of a frozen teacher vision encoder.
1

Topological Alignment via MKNN

Grounded in the Platonic Representation Hypothesis, we measure cross-modal alignment with the Mutual K-Nearest Neighbor (MKNN) metric — the average intersection of the k-nearest neighbor sets induced by the LLM and the teacher vision encoder. MKNN captures local neighborhood agreement rather than global geometric matching.

2

Contrastive Proxy

Since MKNN is non-differentiable, we use a multi-target InfoNCE loss as a differentiable proxy: for each batch element \(i\), the \(k\) visual nearest neighbors of the image \(I_i\) (in teacher space) become positives for the MLLM representation of \((I_i, x_i)\).

3

Head-Level Intervention

We decompose multi-head attention along the rows of the output projection \(W_O\), isolating the contribution of each head before it is summed into the residual stream. This lets HeRA apply the contrastive objective to individual heads, the smallest meaningful unit of an LLM's reasoning machinery.

4

Selecting the Worst-5 Heads

We rank all heads by their pre-training MKNN score against the vision teacher and pick the five least aligned heads. Empirically, aligning these heads boosts their alignment without degrading already-aligned ones — while aligning the top heads (already platonic) or random heads gives little benefit.

Per-head MKNN alignment animation.
Figure 3. The animation demonstrates the core mechanism of HeRA. Multimodal inputs from a batch are processed, and the visual data is passed through a frozen teacher vision encoder. This defines the local neighborhood structure in the teacher's latent space, serving as the target for HeRA's contrastive loss to align specific LLM attention heads.

Results

Across 9 LLMs · Vision-Centric Average

HeRA delivers consistent improvements on the demanding Vision category of the Cambrian benchmark suite, scaling across architectural families (Vicuna, LLama3, Qwen2.5, Qwen3) and model sizes (3B–14B), while preserving or improving General / Knowledge / OCR performance.

Table 1. VQA results of HeRA on the LLaVA training recipe, across nine LLMs. Δ reports the change in the Vision average.
Model General Knowledge OCR Vision Δ Vision
Qwen2.5-3B73.546.742.250.5
+ HeRA74.547.543.852.9+2.4
Qwen3-4B75.649.643.856.3
+ HeRA76.050.144.558.5+2.2
Vicuna-7B72.244.345.749.7
+ HeRA72.144.545.752.0+2.3
LLama3-8B73.345.043.053.8
+ HeRA74.646.344.755.1+1.3
Qwen2.5-7B76.250.247.956.7
+ HeRA76.550.548.657.4+0.7
Qwen3-8B74.749.743.555.9
+ HeRA76.951.147.659.5+3.6
Vicuna-13B73.445.547.752.7
+ HeRA73.645.747.653.9+1.2
Qwen2.5-14B75.650.744.854.9
+ HeRA77.452.849.358.3+3.4
Qwen3-14B77.452.846.158.2
+ HeRA77.752.647.858.9+0.7

Comparison with Representation Alignment Methods

On Qwen3-8B, HeRA outperforms recent representation alignment strategies — including feature reconstruction (ROSS, JARVIS), middle-layer alignment (VIRAL), and global CKA-based alignment (CMAR) — with the highest gains on the demanding Vision category.

Table 2. VQA comparison of different representation alignment strategies for MLLMs.
Alignment General Knowledge OCR Vision Δ Vision
— (LLaVA baseline)74.749.743.555.9
ROSS74.649.444.056.2+0.3
VIRAL73.849.343.654.2−1.7
JARVIS76.849.946.258.7+2.8
CMAR76.451.046.056.9+1.0
HeRA (Ours)76.951.147.659.5+3.6
Comparison of HeRA against competitors.
Figure 4. Comparison of MKNN scores of the Worst-5 and Top-5 heads after the second training stage performed with HeRA and our competitors (i.e. LLaVA, VIRAL, JARVIS, ROSS, CMAR).

Less Visual Hallucinations

Although mitigating hallucinations is not an explicit objective, HeRA consistently reduces hallucination rates on CHAIR-MSCOCO, AMBER (generative & discriminative), and HallusionBench, while preserving cognition/coverage scores. Aligning topology curbs over-reliance on linguistic priors.

Choice of the Vision Teacher

Self-supervised teachers work best: DINOv2 / DINOv3 deliver strong, consistent gains, while using SigLIP2 (the primary vision encoder) as teacher is mostly ineffective. Base-size DINOv2-B is already sufficient — scaling the teacher to 1B parameters gives no further benefit.

Ablation on teacher vision encoders.
Figure 5. VQA results of HeRA with different teacher vision encoders.

Qualitative Examples

HeRA visibly improves visual grounding on fine-grained perception, counting, spatial reasoning, and dense scenes — reducing the model's tendency to fall back on linguistic priors.

Qualitative comparisons.
Figure 6. Qualitative comparison of LLaVA, ROSS, and HeRA across Cambrian categories using Qwen3-8B and SigLIP2.
Additional qualitative examples.
Figure 7. Failure cases of HeRA on VQA tasks.

Citation

@article{caffagni2026hera,
    title   = {{Mind the Heads: Topological Representation Alignment for Multimodal LLMs}},
    author  = {Caffagni, Davide and Compagnoni, Alberto and Melis, Federico and
                Sarto, Sara and Dovesi, Pier Luigi and Granroth-Wilding, Mark and
                Cornia, Marcella and Baraldi, Lorenzo},
    journal = {arXiv preprint},
    year    = {2026}
    }

Acknowledgments

This work has been supported by the EU Horizon project ELLIOT (No. 101214398), by the EuroHPC JU project MINERVA (GA No. 101182737), and by the PNRR project ITSERR (CUP B53C22001770006) funded by the EU — NextGenerationEU. We also acknowledge EuroHPC JU for awarding the project EHPC-AIF-2025SC04-225 access to LUMI at CSC, Finland.