✨CVPR 2026 Highlight✨

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

A novel multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model, enhanced by reinforcement learning for structured reasoning.

Alberto Compagnoni^*,^1,2 Marco Morini^*,¹ Sara Sarto¹ Federico Cocchi^1,2 Davide Caffagni¹ Marcella Cornia¹ Lorenzo Baraldi¹ Rita Cucchiara¹

1University of Modena and Reggio Emilia 2University of Pisa

^*Equal contribution

arXiv Paper GitHub Code AImageLab

Scroll

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning.

To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.

Figure 1. Comparison between Zero-Shot MLLMs, retrieval-augmented models, and ReAG.

Method

A Four-Stage Pipeline

ReAG combines a multi-level retriever, a critic for noise filtering, and a generator trained via SFT + GRPO-inspired RL to reason over retrieved evidence before producing the final answer.

Retrieval

Multi-level embedding-based retrieval against a Wikipedia-scale knowledge base, combining coarse and fine-grained strategies for high recall.

Filtering

A critic model classifies each retrieved passage as relevant or irrelevant, drastically reducing noise before generation.

SFT

Generator trained via supervised fine-tuning on synthetic reasoning traces to acquire the structured <think>/<answer> format as a cold start.

GRPO

Reinforcement learning with task-specific accuracy and format rewards enables free exploration of diverse reasoning trajectories.

Figure 2. Overview of the proposed ReAG model. Multi-level retriever → critic filtering → SFT cold-start → RL training.

Retrieval

Multi-Level Retrieval

ReAG's retrieval is agnostic to the backbone, supporting any cross-modal encoder. Two complementary strategies ensure high recall while enabling subsequent precision tuning.

🔍 Coarse-Grained

Full query image encoded against all document metadata/images. Top-k = 20 documents retained. High recall, lower precision.

🎯 Fine-Grained

spaCy noun-phrase extraction + GroundingDINO localization. Cropped region re-encoded for entity-specific retrieval.

Passage Filtering at k=20

w/o Filtering128.6 passages

ReflectiVA~7 passages

ReAG5.7 passages

Training Strategy

Multi-Stage Learning

ReAG employs a two-phase strategy: SFT instills basic structured reasoning behavior, while RL unlocks free exploration of diverse reasoning trajectories without the constraint of a KL penalty.

Critic

Critic Training

Qwen2.5-VL-3B fine-tuned on 1M passages (balanced E-VQA / InfoSeek), with 30% soft negatives and 70% hard negatives for robust discrimination.

SFT

Cold-Start SFT

Generator exposed to synthetic reasoning traces collected from Qwen2.5-VL-7B. Structured <think>…</think> <answer>…</answer> format enforced. Loss balances answer (α=0.8) and trace.

GRPO Reinforcement Learning

Custom GRPO objective with task-specific accuracy reward (γ=1.0) and format reward (δ=0.2). No KL divergence penalty — enables free exploration. Token-level loss normalization for long-sequence stability.

Figure 5. Task-specific accuracy reward progression across training iterations of the ReAG 7B generator.

Results

State-of-the-Art Performance

ReAG achieves consistent improvements across both benchmarks and model scales, outperforming prior retrieval-augmented methods by a significant margin.

Model	Generator	E-VQA Single-Hop	E-VQA All	InfoSeek Un-Q	InfoSeek Un-E	InfoSeek All
ZERO-SHOT BASELINES
Qwen2.5-VL-3B (ZS)	—	21.9	21.9	18.9	17.7	18.3
Qwen2.5-VL-7B (ZS)	—	23.6	23.2	22.8	24.1	23.7
RETRIEVAL-AUGMENTED
ReflectiVA	Qwen2.5-VL-3B	33.7	35.2	39.6	38.1	38.9
VLM-PRF	Qwen2.5-VL-3B	31.1	32.4	39.7	38.8	39.0
ReAG (Ours)	Qwen2.5-VL-3B	41.3 +7.6	42.9 +7.7	43.7 +4.0	42.9 +4.1	43.3 +4.3
ReflectiVA	Qwen2.5-VL-7B	36.8	36.8	43.5	44.3	43.9
VLM-PRF	Qwen2.5-VL-7B	37.1	36.0	43.3	42.7	42.8
ReAG (Ours)	Qwen2.5-VL-7B	44.9 +4.8	47.0 +7.8	48.3 +4.8	46.2 +1.9	47.2 +3.3

Table 1. VQA accuracy scores on Encyclopedic-VQA test set and InfoSeek validation set. +delta vs best prior method at same scale.

Qualitative Results

Reasoning Traces in Action

ReAG generates explicit, structured reasoning traces that ground the answer in both visual evidence and retrieved passages, providing full explainability.

Q: What is the closest upper taxonomy of this bird?

ZS The closest taxonomy is the family Laridae. ❌

Base Sterna ❌

ReAG Thalasseus ✅

Q: Which road, railway or canal does this bridge carry?

ZS The bridge is Blackfriars Bridge. ❌

Base Road ❌

ReAG A201 ✅

Q: What is the brand of this vehicle?

ZS The vehicle is a Ford. This can be determined… ❌

Base Ford ❌

ReAG Isuzu ✅

Reasoning Trace Example

Q: Who designed this dock?

ZS Not enough information is provided to determine who designed the dock ❌

ReflectiVA Jesse Hartley ❌

ReAG <think> The image shows the Albert Dock in Liverpool, which matches the question asked. The question specifically asks for the designer of this particular dock. The retrieved passage directly states that the dock in question, the Albert Dock, was designed by Jesse Hartley. Since the question specifically matches the dock in the image with the one mentioned in the passage, the retrieved passage directly answers the question without any need to refer to the question itself or the retrieved passage. </think>
<answer> Royal Albert Dock ✓ </answer>

Contributions

Summary

01

We propose ReAG, a novel reasoning-augmented multimodal RAG model that combines coarse- and fine-grained retrieval with a critic model to improve precision and reduce noise. The critic is agnostic to the retrieval backbone.
02

ReAG trains the generator with a multi-stage strategy: SFT cold start followed by a GRPO-inspired reinforcement learning framework with a reward scheme specifically designed for KB-VQA (accuracy + format rewards).
03

We empirically validate ReAG on Encyclopedic-VQA and InfoSeek, reaching a new state-of-the-art on both benchmarks across model scales, with fully open-source code available on GitHub.

Citation

BibTeX

@inproceedings{compagnoni2026reag,
  title     = {{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
  author    = {Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}