CVPR 2026 · Accepted

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

A novel multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model, enhanced by reinforcement learning for structured reasoning.

Alberto Compagnoni*,1,2 Marco Morini*,1 Sara Sarto1 Federico Cocchi1,2 Davide Caffagni1 Marcella Cornia1 Lorenzo Baraldi1 Rita Cucchiara1
1University of Modena and Reggio Emilia 2University of Pisa
*Equal contribution
arXiv Paper GitHub Code AImageLab
Scroll

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning.


To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.

Figure 1. Comparison between Zero-Shot MLLMs, retrieval-augmented models, and ReAG.

0
VQA Accuracy
InfoSeek · Qwen2.5-VL-7B
0
E-VQA Single-Hop
Encyclopedic-VQA · 7B
0
Gain vs Previous Works
E-VQA · 7B backbone
0
Oracle Setting
E-VQA Single-Hop · 7B
0
Noise Reduction
Passages after Filtering

A Four-Stage Pipeline

ReAG combines a multi-level retriever, a critic for noise filtering, and a generator trained via SFT + GRPO-inspired RL to reason over retrieved evidence before producing the final answer.

01
Retrieval

Multi-level embedding-based retrieval against a Wikipedia-scale knowledge base, combining coarse and fine-grained strategies for high recall.

02
Filtering

A critic model classifies each retrieved passage as relevant or irrelevant, drastically reducing noise before generation.

03
SFT

Generator trained via supervised fine-tuning on synthetic reasoning traces to acquire the structured <think>/<answer> format as a cold start.

04
GRPO

Reinforcement learning with task-specific accuracy and format rewards enables free exploration of diverse reasoning trajectories.

Figure 2. Overview of the proposed ReAG model. Multi-level retriever → critic filtering → SFT cold-start → RL training.

Multi-Level Retrieval

ReAG's retrieval is agnostic to the backbone, supporting any cross-modal encoder. Two complementary strategies ensure high recall while enabling subsequent precision tuning.

🔍 Coarse-Grained

Full query image encoded against all document metadata/images. Top-k = 20 documents retained. High recall, lower precision.

🎯 Fine-Grained

spaCy noun-phrase extraction + GroundingDINO localization. Cropped region re-encoded for entity-specific retrieval.

w/o Filtering128.6 passages
ReflectiVA~7 passages
ReAG5.7 passages

Multi-Stage Learning

ReAG employs a two-phase strategy: SFT instills basic structured reasoning behavior, while RL unlocks free exploration of diverse reasoning trajectories without the constraint of a KL penalty.

Critic

Critic Training

Qwen2.5-VL-3B fine-tuned on 1M passages (balanced E-VQA / InfoSeek), with 30% soft negatives and 70% hard negatives for robust discrimination.

SFT

Cold-Start SFT

Generator exposed to synthetic reasoning traces collected from Qwen2.5-VL-7B. Structured <think>…</think> <answer>…</answer> format enforced. Loss balances answer (α=0.8) and trace.

RL

GRPO Reinforcement Learning

Custom GRPO objective with task-specific accuracy reward (γ=1.0) and format reward (δ=0.2). No KL divergence penalty — enables free exploration. Token-level loss normalization for long-sequence stability.

Figure 5. Task-specific accuracy reward progression across training iterations of the ReAG 7B generator.

State-of-the-Art Performance

ReAG achieves consistent improvements across both benchmarks and model scales, outperforming prior retrieval-augmented methods by a significant margin.

Model Generator E-VQA Single-Hop E-VQA All InfoSeek Un-Q InfoSeek Un-E InfoSeek All
ZERO-SHOT BASELINES
Qwen2.5-VL-3B (ZS) 21.921.918.917.718.3
Qwen2.5-VL-7B (ZS) 23.623.222.824.123.7
RETRIEVAL-AUGMENTED
ReflectiVAQwen2.5-VL-3B 33.735.239.638.138.9
VLM-PRFQwen2.5-VL-3B 31.132.439.738.839.0
ReAG (Ours)Qwen2.5-VL-3B 41.3 +7.6 42.9 +7.7 43.7 +4.0 42.9 +4.1 43.3 +4.3
ReflectiVAQwen2.5-VL-7B 36.836.843.544.343.9
VLM-PRFQwen2.5-VL-7B 37.136.043.342.742.8
ReAG (Ours)Qwen2.5-VL-7B 44.9 +4.8 47.0 +7.8 48.3 +4.8 46.2 +1.9 47.2 +3.3

Table 1. VQA accuracy scores on Encyclopedic-VQA test set and InfoSeek validation set. +delta vs best prior method at same scale.

Reasoning Traces in Action

ReAG generates explicit, structured reasoning traces that ground the answer in both visual evidence and retrieved passages, providing full explainability.

Q: What is the closest upper taxonomy of this bird?
ZS The closest taxonomy is the family Laridae.
Base Sterna
ReAG Thalasseus
Q: Which road, railway or canal does this bridge carry?
ZS The bridge is Blackfriars Bridge.
Base Road
ReAG A201
Q: What is the brand of this vehicle?
ZS The vehicle is a Ford. This can be determined…
Base Ford
ReAG Isuzu

Q: Who designed this dock?

ZS Not enough information is provided to determine who designed the dock
ReflectiVA Jesse Hartley
ReAG <think> The image shows the Albert Dock in Liverpool, which matches the question asked. The question specifically asks for the designer of this particular dock. The retrieved passage directly states that the dock in question, the Albert Dock, was designed by Jesse Hartley. Since the question specifically matches the dock in the image with the one mentioned in the passage, the retrieved passage directly answers the question without any need to refer to the question itself or the retrieved passage. </think>
<answer> Royal Albert Dock ✓ </answer>

Summary

BibTeX

@inproceedings{compagnoni2026reag,
  title     = {{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
  author    = {Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}