PhD Candidate · University of Warwick
Jiaqi Li
Computer Vision & Multimodal Learning
I build models that perceive, localize, and act over long-horizon multimodal signals — spanning vision-language-action generation for Embodied AI and video temporal grounding for human actions.

Research Focus
Embodied AI & Vision-Language-Action
Generating physically grounded actions from language and vision for embodied agents.
Video Temporal Grounding
Localizing query-relevant moments and event boundaries in long, untrimmed videos.
Multimodal & Action Understanding
Vision-language models for temporal action localization, reasoning, and reliable evaluation.
- Action Understanding & Generation
- Embodied AI
- Vision-Language-Action
- Video Temporal Grounding
- Temporal Action Localization
- Vision-Language Models
About
I am a PhD candidate in Computer Science at the University of Warwick, supervised by Prof. Yu Guan. Previously, I completed an MSc in Computer Science at The University of Hong Kong and a BSc in Information and Computing Science, with a minor in Computer Science and Technology.
My recent work has been published at ECCV, ICML, ACL, and IMWUT, focusing on making multimodal models more efficient, faithful, and capable of long-horizon reasoning.
At a Glance
- Research AssistantUniversity of Warwick · 2024–25, 2026
- Research AssistantHKUST · 2023–24
News
Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding is accepted to ECCV 2026.
Our collaborative paper Doppler Prompting for Stable mmWave-based Human Pose Estimation is accepted to ICML 2026.
Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization and a collaborative paper were accepted to ACL 2026.
Selected Publications
All publications →Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
We propose ActionVLM, a vision-language framework for temporal action localization that uses Language Advantage to adaptively weight language, mitigating language shortcuts and grounding localization in visual evidence.
Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
SemVID is a training-free VTG token pruning framework that preserves both boundary-critical evidence and cross-frame reasoning.
Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning
A novel fine-tuning framework to automatically synthesize training data tailored for rejecting the questions exceeds the knowledge without compromising on other tasks.
Doppler Prompting for Stable mmWave-based Human Pose Estimation
We improve mmWave human pose stability by treating Doppler as a confidence-gated motion prompt that selectively conditions spatial magnitude, reducing spurious motion artifacts and velocity error across single- and multi-person benchmarks.
Person Parametric Physics-informed Representation for mmWave-based Human Pose Estimation
This paper proposes a new input paradigm for mmWave-based human pose estimation, which models human as an Gaussian ensemble enriched with electromagnetic and kinematic parameters.