
Qualitative cases of ReVSeg. The frame highlighted in red indicates the selected keyframe. The green bounding box within the enlarged keyframe on the right size represents the grounding result.

Stay tuned for exciting visual results!
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories.


Qualitative cases of ReVSeg. The frame highlighted in red indicates the selected keyframe. The green bounding box within the enlarged keyframe on the right size represents the grounding result.
(a) Format reward $r_f$ rapidly converges to a full score and remains saturated.
(b) Temporal reward $r_t$ increases steadily with training.
(c) Spatial reward $r_s$ increases steadily with training.
(d) Response length remains stable overall without collapse.
(e) Total reward $r$ rises consistently over time.
(f) Average number of rollout turns quickly converge to 2.
@article{li2025revseg,
title={ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning},
author={Li, Yifan and Yin, Yingda and Zhu, Lingting and Chen, Weikai and Qian, Shengju and Wang, Xin and Fu, Yanwei},
journal={arXiv preprint arXiv:2512.02835},
year={2025}
}