Generating Multi-Modal Knowledge Clues as an Image: Towards Improving Image-Sequence Reasoning with Assisted Visual Input
Published in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2026
Recent multi-modal large language models (MLLMs) have exhibited powerful abilities in addressing complex vision-language tasks such as image-sequence reasoning (ISR). However, significant challenges remain, e.g., it is still difficult for the MLLMs to fully capture and represent cross-image visual knowledge such as scene relations, attributes, and entity links between multiple images, which hinders them from better solving ISR. To alleviate these issues, we introduce a novel concept Visualized Knowledge Clue (VizKC) - synthetic images that encode key visual and external knowledge from a sequence of input images and are then used alongside the original input images within a multi-image MLLM to enhance reasoning performance. Accordingly, we propose an accompanying approach named VizKC-ISR, composed of two modules - VizKC generation and VizKC utilization. Specifically, in the generation module, VizKC-ISR follows a See-Find-Fuse pipeline: (i) “See - Scene Perception”, to construct an initial VizKC that incorporates scene relations of key visual entities detected from an original image; (ii) “Find - Knowledge Generation”, to generate enriched image captions with real-world knowledge and fine-grained entity details and then extract structured knowledge tuples from generated captions; (iii) “Fuse - Image Editing”, to introduce relevant knowledge tuples into the VizKC via iterative image editing. In the utilization module, we employ a multi-image MLLM (e.g., mPLUG-Owl3) to solve the VizKC-assisted ISR tasks by reasoning with generated knowledge clues. We evaluate VizKC-ISR on nine ISR benchmarks categorized into three multi-image scenarios. The results show that our VizKC-ISR performs best in all tasks, e.g., obtaining the highest average accuracy of 63.1% and surpassing the mPLUG-Owl3 baseline by 6.4 absolute points, due to the bridge between visually-grounded reasoning and multi-modal knowledge challenges.
