Tiny Inference-Time Scaling with
Latent Verifiers

CVPR 2026 Findings

Davide Bucciarelli*^1,2 Evelyn Turri*¹ Lorenzo Baraldi^1,2 Marcella Cornia¹ Lorenzo Baraldi¹ Rita Cucchiara¹

University of Modena and Reggio Emilia, Italy¹ University of Pisa, Italy²
* Equal contribution

VHS operates directly on DiT hidden states to score candidate images
No decoding, no re-encoding, 63% faster

Abstract

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations.

In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors.

We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

Verification in Latent Space

Standard MLLM verifiers score a generated image by decoding the latent z₀ to pixel space, then re-encoding it through a visual backbone (e.g., CLIP) before passing it to an LLM. VHS eliminates this pipeline entirely. Instead, it directly aligns DiT hidden states h_ℓ* with an LLM input space, allowing it to analyze generator features without decoding to pixel space. This design choice significantly reduces the per-candidate verification cost while improving or matching the performance of MLLM-based competitors.

By accessing hidden states at layer ℓ*, VHS also truncates the generator early, skipping the remaining DiT layers during verification. Only the winning candidate is propagated through the full network to produce the final image.

Stage 1

Alignment

Adapts DiT hidden features to the LLM's input space using re-captioned synthetic image–caption pairs, following the LLaVA training scheme. Only the connector is trained.

Stage 2

Verifier Fine-tuning

Specializes the model on binary (Yes/No) quality labels derived from 118k generated samples categorized by GenEval criteria. Weighted cross-entropy addresses class imbalance.

Qualitative Results

Visual comparison of the best pick images by different verifiers for GenEval-generated images.

Visual comparison of the best pick images by different verifiers for images generated by SANA-Sprint on GenEval prompts. ✓ and × express the alignment with the GenEval verifier.

Quantitative Results

We evaluate on GenEval across three wall-clock budgets. At each budget, VHS evaluates more candidates than MLLM w/ CLIP thanks to its 63.3% lower verification latency. This allows VHS to achieve higher final generation quality, improving GenEval scores by +2.7% at the same time budget.

Key Takeaways

🔭

Latent Verifier

VHS directly aligns DiT hidden states h_ℓ* with an LLM input space, eliminating the pixel-space decode–encode roundtrip entirely.

⏱

Wall-Clock Scaling

We define a latency-aware inference-time scaling setting that explicitly measures wall-clock time rather than function evaluations alone.

🧪

Thorough Ablations

We study layer depth, LLM backbone, loss function, and training data, exposing key latency–accuracy trade-offs at tiny compute budgets.

BibTeX

@inproceedings{bucciarelli2026vhs,
        title     = {Tiny Inference-Time Scaling with Latent Verifiers},
        author    = {Bucciarelli, Davide and Turri, Evelyn and Baraldi, Lorenzo and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
        booktitle = {Findings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        year      = {2026}
      }

Tiny Inference-Time Scaling with Latent Verifiers