Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking
of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model
2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical
scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct
SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets)
spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for
long-term tracking and zero-shot generalization.
Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for Surgical iVOS through:
(1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking;
(2) Temporal Semantic Learning (TSL) for instrument understanding; and
(3) Ambiguity-Resilient Learning (ARL) to mitigate annotation inconsistencies across multi-source datasets.
Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving
by 12.99 average J&F over vanilla SAM2. SAM2S further advances performance to 80.42 average J&F, surpassing vanilla
and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong
zero-shot generalization.