ECCV 2026 Open Visual-Pruning Suite

Keeping the Evidence Chain

Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

Jiaqi Li1 Shuntian Zheng1 Yixian Shen2 Jia-Hong Huang2 Xiaoman Lu1 Minzhe Ni1 Yu Guan1

1 University of Warwick    2 University of Amsterdam

95.4%accuracy retained
5.8×prefill speedup
8.7%of original FLOPs

The central observation

Temporal grounding is not frame-level retrieval.

A model must preserve boundary cues and connect evidence across time. A few highly relevant glimpses are not enough.

01

Redundancy can erase boundaries

Visually similar frames may contain brief onset or offset cues that determine the correct temporal window.

02

Saliency creates empty gaps

Concentrating tokens in a few frames breaks long-range routing through the video sequence.

03

Relevance can fragment evidence

Repeatedly selecting local query matches produces isolated glimpses instead of a connected event trace.

Principle one

Evidence Retention

Preserve the query-critical landing distribution of the full-token model, especially around temporal boundaries.

Principle two

Connectivity Strength

Maintain adjacent-frame attention routes among retained tokens so distant evidence can still be aggregated.

Method

Allocate by meaning.
Retain by role.

SemVID is training-free. It first distributes the token budget across frames using query evidence and inter-frame variation, then retains complementary object, motion, and context tokens.

Overview of SemVID frame-level budget allocation and role-aware token selection.
SemVID builds a compact but continuous representation: query-aligned objects anchor the event, motion tokens relay changes, and context tokens prevent temporal gaps.

Object tokens

Keep discriminative evidence

Query-to-patch similarity locates relevant objects; maximal marginal relevance avoids spending the budget on near-duplicates.

Motion tokens

Carry temporal change

Local feature differences reveal changing regions, while query-aware filtering keeps the changes useful to the event.

Context tokens

Bridge the sequence

Per-frame prototypes and saliency anchors maintain a minimum evidence path through otherwise low-budget frames.

Main results

Aggressive pruning, without losing the event.

Qwen3-VL-4B on ActivityNet Captions with only 12.5% of visual tokens.

SemVID mIoU

38.49 95.4% of the full-token baseline

Prefill latency

217.7 msfrom 1263.4 ms

Video FLOPs

4.8 Tfrom 59.4 TFLOPs

Charades-STA

49.8989.0% retained

Token budget

12.5%training-free pruning

Accuracy–efficiency trade-off

Comparable speed. Much stronger grounding.

FastVID is marginally faster, but SemVID recovers +5.33 mIoU by preserving the evidence chain.

MethodmIoU ↑Prefill ↓Speedup ↑
Full tokens40.331263.4 ms1.0×
VisionZip19.89895.2 ms1.4×
FastVID33.16209.7 ms6.0×
SemVID38.49217.7 ms5.8×
Accuracy retention curves across three visual token budgets and three video grounding benchmarks.
Across three benchmarks and increasingly aggressive budgets, SemVID degrades more gracefully than competing methods.

What the ablations say

The allocation policy is portable.

Replacing FastVID’s uniform allocation with SemVID’s semantic budget raises Charades-STA mIoU from 35.98 to 48.88.

+12.90 mIoU

Inside the evidence chain

See what survives.

The analysis makes the mechanism visible: where tokens are allocated, whether intermediate frames stay connected, and where the model’s attention lands after pruning.

Token budget allocation across frames for ToMe, VScan, FastVID, and SemVID.
Semantic allocation. SemVID concentrates capacity around the ground-truth event while preserving relay evidence in intermediate frames. Other methods are uniform, collapse onto a few frames, or leave gaps.
Attention landing maps for VisionZip, FastVID, and SemVID at different pruning ratios.
Attention landing. Even at low budgets, SemVID retains attention on decisive evidence such as the hands and red bag, whereas competing methods progressively lose it.

Beyond temporal grounding

The same evidence principles transfer to VideoQA.

99.7%LongVideoBench accuracy retained
with 25% visual tokens
99.4%VideoMME accuracy retained
with 25% visual tokens

Citation

Found this useful?

Please cite our paper and star the repository. It helps others discover the project.

@article{li2026keeping,
  title   = {Keeping the Evidence Chain: Semantic Evidence
             Allocation for Training-Free Token Pruning in
             Video Temporal Grounding},
  author  = {Li, Jiaqi and Zheng, Shuntian and Shen, Yixian and
             Huang, Jia-Hong and Lu, Xiaoman and Ni, Minzhe and
             Guan, Yu},
  journal = {arXiv preprint arXiv:2603.05663},
  year    = {2026}
}