Surgical video segmentation is crucial for computer-assisted surgery, enabling precise
localization and tracking
of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment
Anything Model
2 (SAM2) offer flexible target specification through visual prompts beyond predefined categories, but face
challenges in surgical
scenarios due to the domain gap and limited tracking capability over extended procedures. To enable comprehensive development
and evaluation of surgical iVOS, we construct
SA-SV, the largest surgical iVOS benchmark with spatio-temporal mask
annotations (masklets)
spanning six procedure types (over 61k frames, 1.6k masklets).
Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for
robust zero-shot surgical iVOS through three complementary innovations:
(1) DiveMem for stable long-term tracking via hybrid temporal sampling and diversity-based filtering;
(2) Temporal Semantic Learning (TSL) for reliable instrument re-identification via vision-language contrastive learning; and
(3) Ambiguity-Resilient Learning (ARL) for mitigating mask drift from annotation inconsistencies
across multi-source datasets.
Extensive experiments demonstrate that SAM2S achieves 79.6 Macro Average J&F,
surpassing vanilla
and fine-tuned SAM2 by 16.6 and 4.1 points respectively, while maintaining 68 FPS real-time
inference and generalizing to diverse unseen surgical scenarios, even to a procedure type entirely absent from training.