MARS2 Workshop

Multimodal Reasoning and Slow Thinking in Large Model Era: Towards System2 and Beyond

October 2025 | Honolulu, Hawaii | ICCV 2025

ICCV Logo MARS2 Logo

The era of large reasoning models (LRM) has begun, bringing new opportunities and challenges to the computer vision community. The strong semantic intelligence of LLM and the long-chain reasoning ability of LRM have opened new frontiers in visual understanding and interpretation.

This workshop aims to bridge the gap between computer vision and large language/reasoning models, focusing on complex tasks requiring advanced reasoning capabilities. We will explore how models can comprehend complex relationships through slow-thinking approaches like Neuro-Symbolic reasoning, Chain-of-Thought, and Multi-step Reasoning, pushing beyond traditional fixed tasks to understand object interactions within complex scenes.

The goal is to bring together perspectives from computer vision, multimodal learning, and large language models to address outstanding challenges in multimodal reasoning and slow thinking in the context of large reasoning models, fostering more flexible and robust understanding in AI systems.

Keynote Speakers

Saining Xie
Saining Xie

NYU Courant

Jiajun Wu
Jiajun Wu

Stanford University

Kristen Grauman
Kristen Grauman

UT Austin

Junyang Lin
Junyang Lin

Alibaba Group

Hamid Rezatofighi
Hamid Rezatofighi

Monash University

Organizers (Sorted by Last Name)

Chen Cheng

ModelScope Community

Chen Change Loy

Nanyang Technological University

David Clifton

University of Oxford

Luc Van Gool

INSAIT Sofia University

Peng Xu

Tsinghua University

Shengwu Xiong

Wuhan University of Technology

Jiajun Zhang

Chinese Academy of Sciences

Competition Organizers (Sorted by Last Name)

Yaxiong Chen Wuhan University of Technology
Peng Xu Tsinghua University
Yifang Zhang Wuhan University of Technology
Jirui Huang Wuhan University of Technology
Ruilin Yao Wuhan University of Technology
Yichen Zhao Wuhan University of Technology
Xinwei Long Tsinghua University
Bo Zhang Wuhan University of Technology
Tianyu Zou Wuhan University of Technology

Preliminary Program

Time Event Presenter
TBD Opening remarks Peng Xu
- Invited talk and Q&A #1 Saining Xie
- Invited talk and Q&A #2 Kristen Grauman
- Oral presentation ≈10 Papers
- Invited talk and Q&A #3 Ishan Misra
- Remark of the Challenge-1 Peng Xu
- Panel discussion Saining Xie, etc.
- Lunch -
- Poster presentation Poster area
- Invited talk and Q&A #4 Jiajun Wu
- Invited talk and Q&A #5 Hamid Rezatofighi
- Remark of the Challenge-2 Chen Change Loy
- Invited talk and Q&A #6 Junyang Lin
- Closing remarks Chen Change Loy

Paper Submission

Important Dates

  • Full Paper/Other Submission Deadline June 10, 2025
  • Notification of Acceptance June 18, 2025
  • Camera-Ready Papers Due July 10, 2025

Submission Guidelines

Submissions must be in PDF format and conform to ICCV 2025 proceedings style (double-blind review). The maximum paper length is 8 pages (excluding references).

We welcome submissions of:

  • Unpublished papers (to be included in proceedings)
  • Abstracts, posters, and work-in-progress (poster presentation only)
  • Papers already accepted at other venues (poster presentation only)

Multimodal Reasoning Competition

Total Prize Pool

¥100,000≈$14000 RMB/Dollar

Awards will be distributed to top performers in each track.

Competition Tracks

Track 1
VG-RS

Visual Grounding in Real-world Scenarios

Evaluating scene perception, object localization, and spatial reasoning

Track 2
VQA-SA

Visual Question Answering with Spatial Awareness

Evaluating spatial, commonsense, and counterfactual reasoning

Track 3
VR-Ads

Visual Reasoning in Creative Advertisement Videos

Evaluating cognitive reasoning abilities in advertisement videos

Competition Timeline

  • Competition Start Date May 1, 2025
  • Dataset Release (New Benchmarks) April 15, 2025
  • Submission Deadline June 15, 2025
  • Winners Announcement During Workshop
Dataset Information

The competition will feature a custom-made dataset with 2K+ images, 1.5K+ videos, 17K+ question-answer pairs, and 15K+ bbox annotations.