October 2025 | Honolulu, Hawaii | ICCV 2025
The era of large reasoning models (LRM) has begun, bringing new opportunities and challenges to the computer vision community. The strong semantic intelligence of LLM and the long-chain reasoning ability of LRM have opened new frontiers in visual understanding and interpretation.
This workshop aims to bridge the gap between computer vision and large language/reasoning models, focusing on complex tasks requiring advanced reasoning capabilities. We will explore how models can comprehend complex relationships through slow-thinking approaches like Neuro-Symbolic reasoning, Chain-of-Thought, and Multi-step Reasoning, pushing beyond traditional fixed tasks to understand object interactions within complex scenes.
The goal is to bring together perspectives from computer vision, multimodal learning, and large language models to address outstanding challenges in multimodal reasoning and slow thinking in the context of large reasoning models, fostering more flexible and robust understanding in AI systems.
NYU Courant
Stanford University
UT Austin
Meta
Alibaba Group
Monash University
ModelScope Community
Nanyang Technological University
University of Oxford
INSAIT Sofia University
Tsinghua University
Wuhan University of Technology
Chinese Academy of Sciences
Time | Event | Presenter |
---|---|---|
TBD | Opening remarks | Peng Xu |
- | Invited talk and Q&A #1 | Saining Xie |
- | Invited talk and Q&A #2 | Kristen Grauman |
- | Oral presentation | ≈10 Papers |
- | Invited talk and Q&A #3 | Ishan Misra |
- | Remark of the Challenge-1 | Peng Xu |
- | Panel discussion | Saining Xie, etc. |
- | Lunch | - |
- | Poster presentation | Poster area |
- | Invited talk and Q&A #4 | Jiajun Wu |
- | Invited talk and Q&A #5 | Hamid Rezatofighi |
- | Remark of the Challenge-2 | Chen Change Loy |
- | Invited talk and Q&A #6 | Junyang Lin |
- | Closing remarks | Chen Change Loy |
Submissions must be in PDF format and conform to ICCV 2025 proceedings style (double-blind review). The maximum paper length is 8 pages (excluding references).
We welcome submissions of:
Awards will be distributed to top performers in each track.
Visual Grounding in Real-world Scenarios
Evaluating scene perception, object localization, and spatial reasoning
Visual Question Answering with Spatial Awareness
Evaluating spatial, commonsense, and counterfactual reasoning
Visual Reasoning in Creative Advertisement Videos
Evaluating cognitive reasoning abilities in advertisement videos
The competition will feature a custom-made dataset with 2K+ images, 1.5K+ videos, 17K+ question-answer pairs, and 15K+ bbox annotations.