A novel multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model, enhanced by reinforcement learning for structured reasoning.
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning.
To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
Figure 1. Comparison between Zero-Shot MLLMs, retrieval-augmented models, and ReAG.
ReAG combines a multi-level retriever, a critic for noise filtering, and a generator trained via SFT + GRPO-inspired RL to reason over retrieved evidence before producing the final answer.
Multi-level embedding-based retrieval against a Wikipedia-scale knowledge base, combining coarse and fine-grained strategies for high recall.
A critic model classifies each retrieved passage as relevant or irrelevant, drastically reducing noise before generation.
Generator trained via supervised fine-tuning on synthetic reasoning traces to acquire the structured <think>/<answer> format as a cold start.
Reinforcement learning with task-specific accuracy and format rewards enables free exploration of diverse reasoning trajectories.
Figure 2. Overview of the proposed ReAG model. Multi-level retriever → critic filtering → SFT cold-start → RL training.
ReAG's retrieval is agnostic to the backbone, supporting any cross-modal encoder. Two complementary strategies ensure high recall while enabling subsequent precision tuning.
Full query image encoded against all document metadata/images. Top-k = 20 documents retained. High recall, lower precision.
spaCy noun-phrase extraction + GroundingDINO localization. Cropped region re-encoded for entity-specific retrieval.
ReAG employs a two-phase strategy: SFT instills basic structured reasoning behavior, while RL unlocks free exploration of diverse reasoning trajectories without the constraint of a KL penalty.
Qwen2.5-VL-3B fine-tuned on 1M passages (balanced E-VQA / InfoSeek), with 30% soft negatives and 70% hard negatives for robust discrimination.
Generator exposed to synthetic reasoning traces collected from Qwen2.5-VL-7B. Structured <think>…</think> <answer>…</answer> format enforced. Loss balances answer (α=0.8) and trace.
Custom GRPO objective with task-specific accuracy reward (γ=1.0) and format reward (δ=0.2). No KL divergence penalty — enables free exploration. Token-level loss normalization for long-sequence stability.
Figure 5. Task-specific accuracy reward progression across training iterations of the ReAG 7B generator.
ReAG achieves consistent improvements across both benchmarks and model scales, outperforming prior retrieval-augmented methods by a significant margin.
| Model | Generator | E-VQA Single-Hop | E-VQA All | InfoSeek Un-Q | InfoSeek Un-E | InfoSeek All |
|---|---|---|---|---|---|---|
| ZERO-SHOT BASELINES | ||||||
| Qwen2.5-VL-3B (ZS) | — | 21.9 | 21.9 | 18.9 | 17.7 | 18.3 |
| Qwen2.5-VL-7B (ZS) | — | 23.6 | 23.2 | 22.8 | 24.1 | 23.7 |
| RETRIEVAL-AUGMENTED | ||||||
| ReflectiVA | Qwen2.5-VL-3B | 33.7 | 35.2 | 39.6 | 38.1 | 38.9 |
| VLM-PRF | Qwen2.5-VL-3B | 31.1 | 32.4 | 39.7 | 38.8 | 39.0 |
| ReAG (Ours) | Qwen2.5-VL-3B | 41.3 +7.6 | 42.9 +7.7 | 43.7 +4.0 | 42.9 +4.1 | 43.3 +4.3 |
| ReflectiVA | Qwen2.5-VL-7B | 36.8 | 36.8 | 43.5 | 44.3 | 43.9 |
| VLM-PRF | Qwen2.5-VL-7B | 37.1 | 36.0 | 43.3 | 42.7 | 42.8 |
| ReAG (Ours) | Qwen2.5-VL-7B | 44.9 +4.8 | 47.0 +7.8 | 48.3 +4.8 | 46.2 +1.9 | 47.2 +3.3 |
Table 1. VQA accuracy scores on Encyclopedic-VQA test set and InfoSeek validation set. +delta vs best prior method at same scale.
ReAG generates explicit, structured reasoning traces that ground the answer in both visual evidence and retrieved passages, providing full explainability.
Q: Who designed this dock?
We propose ReAG, a novel reasoning-augmented multimodal RAG model that combines coarse- and fine-grained retrieval with a critic model to improve precision and reduce noise. The critic is agnostic to the retrieval backbone.
ReAG trains the generator with a multi-stage strategy: SFT cold start followed by a GRPO-inspired reinforcement learning framework with a reward scheme specifically designed for KB-VQA (accuracy + format rewards).
We empirically validate ReAG on Encyclopedic-VQA and InfoSeek, reaching a new state-of-the-art on both benchmarks across model scales, with fully open-source code available on GitHub.
@inproceedings{compagnoni2026reag,
title = {{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
author = {Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}