Generating Multi-Modal Knowledge Clues as an Image: Towards Improving Image-Sequence Reasoning with Assisted Visual Input

Guanghui Ye, Huan Zhao, Yixian Shen, Jiaqi Li, Fengnan Li, Zhihua Jiang, Keqin Li

Published in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2026

Recent multi-modal large language models (MLLMs) have exhibited powerful abilities in addressing complex vision-language tasks such as image-sequence reasoning (ISR). However, significant challenges remain, e.g., it is still difficult for the MLLMs to fully capture and represent cross-image visual knowledge such as scene relations, attributes, and entity links between multiple images, which hinders them from better solving ISR. To alleviate these issues, we introduce a novel concept Visualized Knowledge Clue (VizKC) - synthetic images that encode key visual and external knowledge from a sequence of input images and are then used alongside the original input images within a multi-image MLLM to enhance reasoning performance. Accordingly, we propose an accompanying approach named VizKC-ISR, composed of two modules - VizKC generation and VizKC utilization. Specifically, in the generation module, VizKC-ISR follows a See-Find-Fuse pipeline: (i) “See - Scene Perception”, to construct an initial VizKC that incorporates scene relations of key visual entities detected from an original image; (ii) “Find - Knowledge Generation”, to generate enriched image captions with real-world knowledge and fine-grained entity details and then extract structured knowledge tuples from generated captions; (iii) “Fuse - Image Editing”, to introduce relevant knowledge tuples into the VizKC via iterative image editing. In the utilization module, we employ a multi-image MLLM (e.g., mPLUG-Owl3) to solve the VizKC-assisted ISR tasks by reasoning with generated knowledge clues. We evaluate VizKC-ISR on nine ISR benchmarks categorized into three multi-image scenarios. The results show that our VizKC-ISR performs best in all tasks, e.g., obtaining the highest average accuracy of 63.1% and surpassing the mPLUG-Owl3 baseline by 6.4 absolute points, due to the bridge between visually-grounded reasoning and multi-modal knowledge challenges.