Q-Zoom: Query-Aware Adaptive Perception for Efficient MLLMs

Introduction简介

Q-Zoom is a query-aware adaptive perception framework that lets a Multimodal LLM look at high resolution only when it needs to, and only where it needs to — without changing the underlying backbone. Instead of indiscriminately flooding the quadratic self-attention with redundant high-resolution tokens, Q-Zoom reasons in a coarse-to-fine manner using two lightweight modules grafted on top of an off-the-shelf MLLM:

Dynamic Gating Network. A small head decides per query whether the coarse global pass already suffices, and only routes the hard cases through the high-resolution refinement loop. Routing labels come from a consistency-aware sample-generation strategy that filters unstable / contradictory examples before BCE supervision.
Self-Distilled Region Proposal Network (SD-RPN). When detail is needed, SD-RPN localizes the task-relevant Region-of-Interest directly from the MLLM's own intermediate feature space — fully self-supervised, with no external detector and no extra annotation. The cropped region is then re-decoded and fused with the coarse global layout via a continuous spatio-temporal alignment scheme + targeted post-SFT.
Single-pass coarse-to-fine inference. At test time Q-Zoom always emits a direct response from the low-resolution pass; the gating head decides per sample whether to also emit a RoI-based response. The two are fused in a single forward path — no tree search, no RL roll-outs, no external tool calls.

Together, these three pieces let Q-Zoom establish a dominant Pareto frontier: on Qwen2.5-VL-7B it is 2.52× faster on Doc/OCR benchmarks and 4.39× faster on high-resolution benchmarks at matched accuracy, and when configured for maximum fidelity it surpasses the baseline's peak by +1.1% and +8.1% respectively. The recipe transfers seamlessly to Qwen3-VL, LLaVA-1.5, and emerging RL-based thinking-with-image models.

Q-Zoom 是一种查询感知的自适应视觉感知框架，使多模态大模型 只在真正需要时、也只在真正相关的区域进行高分辨率观察，而且无需改动底层骨干网络。它不再把冗余的高分辨率视觉 token 一股脑送入具有二次复杂度的自注意力模块，而是在现成的 MLLM 之上增添两个轻量模块，以由粗到细的方式完成推理：

动态门控网络（Dynamic Gating Network）。一个轻量头会根据查询判断，粗粒度的全局结果是否已经足够；只有困难样本才会被送入高分辨率细化流程。路由标签来自一种一致性感知的样本生成策略，会在进行 BCE 监督前剔除不稳定或相互矛盾的样本。
自蒸馏区域提议网络（SD-RPN）。当确实需要细节时，SD-RPN 会直接在 MLLM 自身的中间特征空间中定位与任务相关的感兴趣区域（RoI），整个过程完全自监督，无需外部检测器，也不需要额外标注。随后，模型会对裁剪区域重新解码，并通过连续的时空对齐机制和针对性的后续 SFT，与粗粒度的全局布局信息进行融合。
单次前向的粗到细推理。测试时，Q-Zoom 会先基于低分辨率输入给出直接响应；门控头再按样本决定是否额外生成基于 RoI 的响应。两路结果在同一条前向路径中完成融合，无需树搜索、无需强化学习 rollout，也无需外部工具调用。

这三个组件共同让 Q-Zoom 建立起显著占优的帕累托前沿：在 Qwen2.5-VL-7B 上，在精度相当的前提下，它在 Doc/OCR 基准上可提速 2.52 倍，在高分辨率基准上可提速 4.39 倍；当配置为最高保真模式时，又分别比基线峰值高出 +1.1% 和 +8.1%。这一方案还能无缝迁移到 Qwen3-VL、LLaVA-1.5，以及新兴的基于强化学习的图像思考类模型。

2.52×

Doc/OCR speedup
(Qwen2.5-VL-7B) Doc/OCR 提速
(Qwen2.5-VL-7B)

4.39×

High-res speedup
(Qwen2.5-VL-7B) 高分辨率任务提速
(Qwen2.5-VL-7B)

+1.1%

Doc/OCR over baseline peak Doc/OCR 超过基线峰值

+8.1%

High-res over baseline peak 高分辨率超越基线峰值

Method方法

Inference: gated two-pass decoding 推理：门控双阶段解码

At inference, Q-Zoom always emits a direct response from the low-resolution pass; the high-res gating head decides per sample whether to also emit a RoI-based response by re-decoding the cropped region predicted by SD-RPN.

推理时，Q-Zoom 会先基于低分辨率输入生成直接响应；随后由高分辨率门控头按样本判断，是否需要对 SD-RPN 预测出的裁剪区域重新解码，从而生成额外的基于 RoI 的响应。

Three-stage training pipeline 三阶段训练流程

Each stage trains a different subset of parameters while everything else stays frozen:

每个阶段只训练不同的一部分参数，其余参数全部保持冻结：

Stage 1 — SD-RPN initialization 阶段 1：SD-RPN 初始化

Train only the twig_T twig layers (the gating branch grafted on top of base layer twig_K) using pseudo ROI maps generated by the base VLM. The base LLM, lm_head, and vision encoder all stay frozen.

仅训练 twig_T 个 twig 层（即接在第 twig_K 层之上的 twig 分支）；监督信号来自基础 VLM 自行生成的伪 RoI 注意力图。基础 LLM、lm_head 和视觉编码器均保持冻结。

Stage 2 — Targeted post-SFT 阶段 2：针对性后续 SFT

The frozen TWIG branch is carried over from Stage 1; only the LLM decoder + lm_head are trained on a hard-sample mixture mined by an LLM-as-a-Judge that compares the direct response against the RoI response on a fresh question pool.

沿用并冻结阶段 1 得到的 TWIG 分支；只在由 LLM-as-a-Judge 从新问题池中挖掘出的困难样本混合集上，训练 LLM 解码器和 lm_head。这些困难样本来自对直接响应与基于 RoI 的响应的比较结果。

Stage 3 — Dynamic Gating Network 阶段 3：动态门控网络

Train only the high-res gating network (high_res_layers + high_res_head) that decides per input whether to fire the high-resolution RoI refinement pass. Stage 3 uses a consistency-aware sample-generation pipeline that filters unstable / contradictory examples before BCE supervision.

仅训练高分辨率门控网络（high_res_layers + high_res_head），由它决定每个输入是否需要触发高分辨率 RoI 细化流程。阶段 3 采用一致性感知 的样本生成流程，会在 BCE 监督前剔除不稳定或相互矛盾的样本。

Results实验结果

Main results主要结果

Q-Zoom is competitive with — and often surpasses — recent detail-zoom and tool-use baselines across five backbones (LLaVA-1.5-7B/13B, Qwen2.5-VL-3B/7B, Qwen3-VL-4B).

Q-Zoom 在五种骨干网络（LLaVA-1.5-7B/13B、Qwen2.5-VL-3B/7B、 Qwen3-VL-4B）上，与近期的细节缩放和工具调用基线相比均具竞争力，而且多数情况下更优。

Performance on Document & OCR benchmarks. Dataset subscripts denote the evaluation split. Performance subscripts show the absolute improvement (↑) over the baseline. Throughput is relative to the baseline, measured on a single NVIDIA A6000 GPU, and our results are evaluated under a per-single-image constraint of 576 maximum visual tokens. Document & OCR 基准上的结果。 数据集下标表示评测划分；性能下标表示相对基线的绝对提升（↑）。吞吐量以基线为参照，在单张 NVIDIA A6000 GPU 上测得；所有结果均在单张图像最多 576 个视觉 token的约束下评测。

Main results: Vision-Centric and High-Resolution benchmarks

Performance on Vision-Centric and High-Resolution benchmarks. Dataset subscripts denote the specific evaluation split; performance subscripts indicate the absolute improvement (↑) of Q-Zoom over the baseline. Tp denotes relative inference throughput, and averages are computed exclusively across the four Overall metrics (V* Bench, MME-RW-lite, HR-Bench-4K, HR-Bench-8K). Unless otherwise noted, all Q-Zoom results are evaluated under a per-single-image constraint of 4,096 maximum visual tokens. The † symbol marks numbers cited directly from the corresponding original publications. 视觉中心与高分辨率基准上的结果。 数据集下标表示具体评测划分；性能下标表示 Q-Zoom 相对基线的绝对提升（↑）。 Tp 表示相对推理吞吐量；平均值只在四项 Overall 指标（V* Bench、MME-RW-lite、HR-Bench-4K、HR-Bench-8K）上计算。除非另有说明，所有 Q-Zoom 结果均在单张图像最多 4,096 个视觉 token的约束下评测。 † 表示直接引自对应原始论文的数值。

Accuracy / throughput trade-off 精度 / 吞吐量权衡

Sweeping the visual-token budget from 256 to 4096 traces a Pareto curve. Q-Zoom (orange) shifts the curve up and to the right of the baseline VLM (gray) on both Doc/OCR-heavy and high-resolution vision benchmarks: at matched accuracy it is multiple times faster, and at matched throughput it lifts average accuracy by several points. The trend is consistent across both supported backbones.

将视觉 token 预算从 256 扫描到 4096，可以得到一条帕累托曲线。无论是在 Doc/OCR 密集型基准，还是在高分辨率视觉基准上，Q-Zoom（橙色）都把这条曲线整体推到了基线 VLM（灰色）的右上方：在相同精度下可实现数倍提速，在相同吞吐量下又能把平均精度提升数个百分点。这一趋势在两种支持的骨干网络上都保持一致。

Qwen2.5-VL-7B

Doc/OCR — 2.52× speedup at matched accuracy, +1.1% over baseline peak. Doc/OCR：在相同精度下提速 2.52×，比基线峰值高 1.1%。

High-resolution — 4.39× speedup at matched accuracy, +8.1% over baseline peak. 高分辨率：在相同精度下提速 4.39×，比基线峰值高 8.1%。

Qwen3-VL-4B

Doc/OCR — 1.70× speedup at matched accuracy, 53.1% visual-token reduction. Doc/OCR：在相同精度下提速 1.70×，视觉 token 数减少 53.1%。

High-resolution — 2.73× speedup at matched accuracy, 72.9% visual-token reduction. 高分辨率：在相同精度下提速 2.73×，视觉 token 数减少 72.9%。

Qualitative examples 定性示例

Q-Zoom qualitative examples with RoI heatmap

SD-RPN produces a per-token attention map (middle column) that isolates the region the question is actually about. The model then re-decodes only that region to recover the correct answer. SD-RPN 会生成逐 token 的注意力图（中间列），定位出问题真正关注的区域；模型随后只对该区域重新解码，即可恢复正确答案。

The dynamic gate in action 动态门控网络的实际效果

Two real samples processed by the same Q-Zoom-Qwen2.5-VL-7B checkpoint at the canonical resolution. On the left (InfoVQA), the gating head fires (score 0.984) → SD-RPN finds the small "Wages per day" bar chart → re-decoding the crop reads the Ecuador and Panama bars and sums them. On the right (TextVQA), the gating head correctly skips (score 0.000) — the answer is globally obvious — and the model answers directly without any extra compute.

Note: the SD-RPN heatmap and gate scores shown here are produced under a per-single-image constraint of 576 maximum visual tokens.

两个真实样本，由同一个 Q-Zoom-Qwen2.5-VL-7B 模型在相同分辨率下处理。左侧（InfoVQA）：门控头触发（分数 0.984）→ SD-RPN 找到那个小小的“每日工资”柱状图 → 仅对裁剪区域重解码即可读出厄瓜多尔与巴拿马的数值并相加。右侧（TextVQA）：门控头正确选择跳过（分数 0.000）— 因为答案在全局视图中显而易见 — 模型无需额外计算就直接给出回答。

注：此处的 SD-RPN 热图和门控分数是在 单张图像最多 576 个视觉 token 的约束下得到的。

Case A — gate fires, RoI re-decode helps 情况 A — 门控触发，RoI 重解码起作用

"What is the wage per day in Ecuador and Panama, taken together?" · gate=0.984 "What is the wage per day in Ecuador and Panama, taken together?" · 门控分数 0.984

RoI crop · re-decoded RoI 裁剪 · 重新解码 Cropped WAGES PER DAY chart

Q-Zoom: $18.25 ✓ | direct: 124 Q-Zoom: $18.25 ✓ | 直接解码: 124

SD-RPN heatmap → bbox → re-decode the chart → "$18.25" SD-RPN 热图 → 边界框 → 重新解码该图表 → "$18.25"

Case B — gate skips, no RoI loop fires 情况 B — 门控跳过，无需触发 RoI 循环

"What does the sign say?" · gate=0.000 "What does the sign say?" · 门控分数 0.000

Gate 0.000 < 0.05 → SKIP RoI 门控 0.000 < 0.05 → 跳过 RoI

Q-Zoom: Stop ✓ (direct decode) Q-Zoom: Stop ✓（直接解码）

Case B — gate skips again 情况 B — 门控再次跳过

"What does it say on the car?" · gate=0.000 "What does it say on the car?" · 门控分数 0.000

Source image (toy police car with POLICE livery)

Gate 0.000 < 0.05 → SKIP RoI 门控 0.000 < 0.05 → 跳过 RoI

Q-Zoom: POLICE ✓ (direct decode) Q-Zoom: POLICE ✓（直接解码）

For the two right-hand examples the gating sigmoid is essentially zero — SD-RPN never runs at all and the model decodes directly. 对于右侧两个样本，门控 sigmoid 几乎为零 — SD-RPN 完全不运行，模型直接解码。

Released checkpoints 已发布模型权重

The final Stage-3 weights are on Hugging Face for all three supported backbones.

三种已支持骨干网络的最终 Stage-3 权重均已发布到 Hugging Face。

Backbone骨干模型	HF modelHugging Face 模型	TWIG configTWIG 配置
Qwen2.5-VL-3B	`YuhengSSS/Q-Zoom-Qwen2.5VL-3B`	K=24, T=3
Qwen2.5-VL-7B	`YuhengSSS/Q-Zoom-Qwen2.5VL-7B`	K=18, T=3
Qwen3-VL-4B	`YuhengSSS/Q-Zoom-Qwen3VL-4B`	K=24, T=3

huggingface-cli download YuhengSSS/Q-Zoom-Qwen2.5VL-7B \
  --local-dir ./checkpoints/Q-Zoom-Qwen2.5VL-7B \
  --local-dir-use-symlinks False

CHECKPOINT_PATH=./checkpoints/Q-Zoom-Qwen2.5VL-7B \
  bash examples/eval_only/eval_qwen2_5vl_stage3.sh

BibTeX引用 (BibTeX)

@article{qzoom,
  title  = {Q-Zoom: Query-Aware Adaptive Perception for Efficient
            Multimodal Large Language Models},
  author = {Shi, Yuheng and Pei, Xiaohuan and Wen, Linfeng and
            Dong, Minjing and Xu, Chang},
  journal= {arXiv preprint arXiv:2604.06912},
  year   = {2026}
}

You may also be interested in our earlier work that introduced the self-distilled RoI predictor used by Q-Zoom's SD-RPN branch: 您也许会对我们更早的一项工作感兴趣；该工作首次提出了 Q-Zoom 的 SD-RPN 分支所采用的自蒸馏 RoI 预测器：

@article{shi2025catching,
  title  = {Catching the Details: Self-Distilled RoI Predictors for
            Fine-Grained MLLM Perception},
  author = {Shi, Yuheng and Pei, Xiaohuan and Dong, Minjing and Xu, Chang},
  journal= {arXiv preprint arXiv:2509.16944},
  year   = {2025}
}

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models Q-Zoom：面向高效多模态大模型的查询感知自适应视觉框架