HeRA: Head-Wise Representation Alignment for Multimodal LLMs

Abstract

Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors.

TL;DR. Instead of aligning a fixed LLM layer to vision features, HeRA aligns the topology (local neighborhoods) of a handful of worst-aligned attention heads with a frozen vision teacher via a contrastive loss. Consistent gains on vision-centric VQA and improved hallucination scores — on 9 LLMs from 3B to 14B.

Key Contributions

Granularity

Head-Wise Alignment

We move beyond fixed-layer alignment and intervene at the level of individual attention heads, an atomic unit that mitigates conflicts with the language modeling objective.

Objective

Topology, not Features

A multi-target contrastive loss acts as a differentiable proxy for the MKNN metric, aligning the local neighborhood structure rather than enforcing rigid feature matching.

Selection

Pick the Worst Heads

Counterintuitively, applying HeRA to the least aligned heads (ranked by MKNN before training) yields the largest gains — strengthening weak components while preserving aligned ones.

Results

Robust Gains, Less Hallucination

Up to +3.6 pts on Vision-Centric benchmarks across 9 LLMs (3B–14B), with stable or improved General/Knowledge/OCR performance and reduced visual hallucinations.

Method

HeRA augments the standard language-modeling objective with a contrastive head-level alignment loss that pushes the multimodal representation of an image–caption pair \((I_i, x_i)\), computed at a small set of selected attention heads, toward its \(k\)-nearest neighbors in the latent space of a frozen vision teacher.

Figure 2. Overview of HeRA. Alongside the standard language modeling objective \(\mathcal{L}_{\text{LM}}\), HeRA employs a contrastive loss \(\mathcal{L}_{\text{HeRA}}\) to pull representations from selected LLM attention heads closer to their \(k\)-nearest neighbors (Top-\(k\)), computed in the latent space of a frozen teacher vision encoder.

Topological Alignment via MKNN

Grounded in the Platonic Representation Hypothesis, we measure cross-modal alignment with the Mutual K-Nearest Neighbor (MKNN) metric — the average intersection of the k-nearest neighbor sets induced by the LLM and the teacher vision encoder. MKNN captures local neighborhood agreement rather than global geometric matching.

Contrastive Proxy

Since MKNN is non-differentiable, we use a multi-target InfoNCE loss as a differentiable proxy: for each batch element \(i\), the \(k\) visual nearest neighbors of the image \(I_i\) (in teacher space) become positives for the MLLM representation of \((I_i, x_i)\).

Head-Level Intervention

We decompose multi-head attention along the rows of the output projection \(W_O\), isolating the contribution of each head before it is summed into the residual stream. This lets HeRA apply the contrastive objective to individual heads, the smallest meaningful unit of an LLM's reasoning machinery.

Selecting the Worst-5 Heads

We rank all heads by their pre-training MKNN score against the vision teacher and pick the five least aligned heads. Empirically, aligning these heads boosts their alignment without degrading already-aligned ones — while aligning the top heads (already platonic) or random heads gives little benefit.

Figure 3. The animation demonstrates the core mechanism of HeRA. Multimodal inputs from a batch are processed, and the visual data is passed through a frozen teacher vision encoder. This defines the local neighborhood structure in the teacher's latent space, serving as the target for HeRA's contrastive loss to align specific LLM attention heads.

Results

Across 9 LLMs · Vision-Centric Average

HeRA delivers consistent improvements on the demanding Vision category of the Cambrian benchmark suite, scaling across architectural families (Vicuna, LLama3, Qwen2.5, Qwen3) and model sizes (3B–14B), while preserving or improving General / Knowledge / OCR performance.

**Table 1.** VQA results of HeRA on the LLaVA training recipe, across nine LLMs. Δ reports the change in the Vision average.
Model	General	Knowledge	OCR	Vision	Δ Vision
Qwen2.5-3B	73.5	46.7	42.2	50.5
+ HeRA	74.5	47.5	43.8	52.9	+2.4
Qwen3-4B	75.6	49.6	43.8	56.3
+ HeRA	76.0	50.1	44.5	58.5	+2.2
Vicuna-7B	72.2	44.3	45.7	49.7
+ HeRA	72.1	44.5	45.7	52.0	+2.3
LLama3-8B	73.3	45.0	43.0	53.8
+ HeRA	74.6	46.3	44.7	55.1	+1.3
Qwen2.5-7B	76.2	50.2	47.9	56.7
+ HeRA	76.5	50.5	48.6	57.4	+0.7
Qwen3-8B	74.7	49.7	43.5	55.9
+ HeRA	76.9	51.1	47.6	59.5	+3.6
Vicuna-13B	73.4	45.5	47.7	52.7
+ HeRA	73.6	45.7	47.6	53.9	+1.2
Qwen2.5-14B	75.6	50.7	44.8	54.9
+ HeRA	77.4	52.8	49.3	58.3	+3.4
Qwen3-14B	77.4	52.8	46.1	58.2
+ HeRA	77.7	52.6	47.8	58.9	+0.7

Comparison with Representation Alignment Methods

On Qwen3-8B, HeRA outperforms recent representation alignment strategies — including feature reconstruction (ROSS, JARVIS), middle-layer alignment (VIRAL), and global CKA-based alignment (CMAR) — with the highest gains on the demanding Vision category.

**Table 2.** VQA comparison of different representation alignment strategies for MLLMs.
Alignment	General	Knowledge	OCR	Vision	Δ Vision
— (LLaVA baseline)	74.7	49.7	43.5	55.9
ROSS	74.6	49.4	44.0	56.2	+0.3
VIRAL	73.8	49.3	43.6	54.2	−1.7
JARVIS	76.8	49.9	46.2	58.7	+2.8
CMAR	76.4	51.0	46.0	56.9	+1.0
HeRA (Ours)	76.9	51.1	47.6	59.5	+3.6

Figure 4. Comparison of MKNN scores of the Worst-5 and Top-5 heads after the second training stage performed with HeRA and our competitors (i.e. LLaVA, VIRAL, JARVIS, ROSS, CMAR).

Less Visual Hallucinations

Although mitigating hallucinations is not an explicit objective, HeRA consistently reduces hallucination rates on CHAIR-MSCOCO, AMBER (generative & discriminative), and HallusionBench, while preserving cognition/coverage scores. Aligning topology curbs over-reliance on linguistic priors.

Choice of the Vision Teacher

Self-supervised teachers work best: DINOv2 / DINOv3 deliver strong, consistent gains, while using SigLIP2 (the primary vision encoder) as teacher is mostly ineffective. Base-size DINOv2-B is already sufficient — scaling the teacher to 1B parameters gives no further benefit.

Figure 5. VQA results of HeRA with different teacher vision encoders.

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Abstract

Key Contributions

Head-Wise Alignment

Topology, not Features

Pick the Worst Heads

Robust Gains, Less Hallucination

Method

Topological Alignment via MKNN

Contrastive Proxy

Head-Level Intervention

Selecting the Worst-5 Heads

Results

Across 9 LLMs · Vision-Centric Average

Comparison with Representation Alignment Methods

Less Visual Hallucinations

Choice of the Vision Teacher

Qualitative Examples

Citation

Acknowledgments