Main results主要结果
Q-Zoom is competitive with — and often surpasses — recent
detail-zoom and tool-use baselines across five backbones
(LLaVA-1.5-7B/13B, Qwen2.5-VL-3B/7B, Qwen3-VL-4B).
Q-Zoom 在五种骨干网络(LLaVA-1.5-7B/13B、Qwen2.5-VL-3B/7B、
Qwen3-VL-4B)上,与近期的细节缩放和工具调用基线相比均具竞争力,而且多数情况下更优。
Accuracy / throughput trade-off
精度 / 吞吐量权衡
Sweeping the visual-token budget from 256 to 4096 traces a Pareto curve. Q-Zoom (orange)
shifts the curve up and to the right of the baseline VLM (gray) on both
Doc/OCR-heavy and high-resolution vision benchmarks: at matched accuracy it is multiple
times faster, and at matched throughput it lifts average accuracy by several points.
The trend is consistent across both supported backbones.
将视觉 token 预算从 256 扫描到 4096,可以得到一条帕累托曲线。无论是在 Doc/OCR
密集型基准,还是在高分辨率视觉基准上,Q-Zoom(橙色)都把这条曲线整体推到了
基线 VLM(灰色)的右上方:在相同精度下可实现数倍提速,在相同吞吐量下又能把平均精度
提升数个百分点。这一趋势在两种支持的骨干网络上都保持一致。
The dynamic gate in action
动态门控网络的实际效果
Two real samples processed by the same Q-Zoom-Qwen2.5-VL-7B checkpoint at the
canonical resolution. On the left (InfoVQA), the gating head fires
(score 0.984) → SD-RPN finds the small "Wages per day" bar chart
→ re-decoding the crop reads the Ecuador and Panama bars and sums them. On the
right (TextVQA), the gating head correctly skips
(score 0.000) — the answer is globally obvious — and the model
answers directly without any extra compute.
Note: the SD-RPN heatmap and gate scores shown here are produced under a
per-single-image constraint of 576 maximum visual tokens.
两个真实样本,由同一个 Q-Zoom-Qwen2.5-VL-7B 模型在相同分辨率下处理。
左侧(InfoVQA):门控头触发(分数
0.984)→ SD-RPN 找到那个小小的“每日工资”柱状图 →
仅对裁剪区域重解码即可读出厄瓜多尔与巴拿马的数值并相加。
右侧(TextVQA):门控头正确选择跳过(分数
0.000)— 因为答案在全局视图中显而易见 — 模型无需额外计算就
直接给出回答。
注:此处的 SD-RPN 热图和门控分数是在
单张图像最多 576 个视觉 token
的约束下得到的。