| Budget |
Generator |
Steps |
Verifier |
Best-of-N |
Single |
Two |
Counting |
Color |
Position |
Attribution |
Overall |
| 200ms |
SANA-Sprint |
1 |
– |
Best-of-1 |
99.3 |
88.1 |
56.0 |
87.6 |
54.1 |
47.8 |
71.6 |
| 550ms |
SANA-1.5 |
4 |
– |
Best-of-1 |
98.8 |
78.2 |
66.5 |
71.1 |
50.6 |
20.8 |
63.0 |
| SANA-Sprint |
8 |
– |
Best-of-1 |
99.5 |
91.9 |
59.3 |
86.0 |
57.8 |
52.4 |
74.0 |
| SANA-Sprint |
1 |
MLLM w/ CLIP |
Best-of-2 |
100.0 |
91.3 |
59.5 |
88.0 |
61.0 |
55.4 |
75.4 |
| SANA-Sprint |
1 |
MLLM w/ AE |
Best-of-3 |
100.0 |
90.9 |
59.0 |
89.6 |
55.8 |
50.6 |
73.1 |
| SANA-Sprint |
1 |
VHS (Ours) |
Best-of-4 |
100.0 |
93.9 |
61.5 |
90.6 |
66.2 |
58.4 |
78.1 |
| 1100ms |
SANA-1.5 |
12 |
– |
Best-of-1 |
100.0 |
92.7 |
74.8 |
88.3 |
61.4 |
59.6 |
78.8 |
| SANA-Sprint |
20 |
– |
Best-of-1 |
100.0 |
88.5 |
59.8 |
89.6 |
48.6 |
51.0 |
72.2 |
| SANA-Sprint |
1 |
MLLM w/ CLIP |
Best-of-4 |
100.0 |
92.7 |
66.0 |
88.9 |
65.9 |
61.6 |
78.8 |
| SANA-Sprint |
1 |
MLLM w/ AE |
Best-of-7 |
99.7 |
90.7 |
61.3 |
90.8 |
59.6 |
49.3 |
74.7 |
| SANA-Sprint |
1 |
VHS (Ours) |
Best-of-9 |
100.0 |
95.7 |
66.5 |
88.9 |
69.8 |
63.8 |
80.5 |
| 1650ms |
SANA-1.5 |
16 |
– |
Best-of-1 |
99.7 |
93.5 |
77.3 |
89.1 |
60.2 |
60.8 |
79.4 |
| SANA-Sprint |
30 |
– |
Best-of-1 |
100.0 |
90.5 |
57.3 |
85.1 |
49.3 |
50.2 |
71.4 |
| SANA-Sprint |
1 |
MLLM w/ CLIP |
Best-of-6 |
100.0 |
93.9 |
68.2 |
88.7 |
69.8 |
64.2 |
80.4 |
| SANA-Sprint |
1 |
MLLM w/ AE |
Best-of-11 |
99.7 |
90.5 |
59.3 |
89.8 |
58.4 |
49.0 |
73.9 |
| SANA-Sprint |
1 |
VHS (Ours) |
Best-of-15 |
100.0 |
96.0 |
67.3 |
89.1 |
70.4 |
64.6 |
80.9 |
Table 1. Accuracy (%) on the GenEval benchmark across computational budgets, generator backbones, and verifier configurations (on LLM Qwen2.5-0.5B). Results compare SANA-1.5 and SANA-Sprint under matched wall-clock budgets (milliseconds), with each verifier operating under the same time constraint via adaptive Best-of-N.