Surgical video segmentation is crucial for computer-assisted surgery, enabling precise
localization and tracking
of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment
Anything Model
2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face
challenges in surgical
scenarios due to the domain gap and limited long-term tracking. To address these limitations, we
construct
SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal
annotations (masklets)
spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development
and evaluation for
long-term tracking and zero-shot generalization.
Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for
Surgical iVOS through:
(1) DiveMem, a trainable diverse memory mechanism for robust long-term
tracking;
(2) Temporal Semantic Learning (TSL) for instrument understanding; and
(3) Ambiguity-Resilient Learning (ARL) to mitigate annotation inconsistencies
across multi-source datasets.
Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance
gains, with SAM2 improving
by 12.99 average J&F over vanilla SAM2. SAM2S further advances performance to 80.42 average J&F,
surpassing vanilla
and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time
inference and strong
zero-shot generalization.