SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

1National University of Singapore
2University of Sheffield
*Corresponding author

Video Demonstrations

SAM2S achieves real-time performance at 68 FPS with robust long-term tracking across diverse surgical procedures

CIS-Test Dataset: Long-term tracking in cholecystectomy

EndoVis18 Dataset: Zero-shot generalization on unseen nephrectomy procedures

RARP50 Dataset: Robust segmentation in robot-assisted radical prostatectomy

Endoscapes2023 Dataset: Robust tissue tracking under viewpoint changes and tissue deformation on unseen video

NUH In-House Dataset: Generalization to dynamic scenes in external hysterectomy video

Abstract

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) offer flexible target specification through visual prompts beyond predefined categories, but face challenges in surgical scenarios due to the domain gap and limited tracking capability over extended procedures. To enable comprehensive development and evaluation of surgical iVOS, we construct SA-SV, the largest surgical iVOS benchmark with spatio-temporal mask annotations (masklets) spanning six procedure types (over 61k frames, 1.6k masklets). Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for robust zero-shot surgical iVOS through three complementary innovations: (1) DiveMem for stable long-term tracking via hybrid temporal sampling and diversity-based filtering; (2) Temporal Semantic Learning (TSL) for reliable instrument re-identification via vision-language contrastive learning; and (3) Ambiguity-Resilient Learning (ARL) for mitigating mask drift from annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that SAM2S achieves 79.6 Macro Average J&F, surpassing vanilla and fine-tuned SAM2 by 16.6 and 4.1 points respectively, while maintaining 68 FPS real-time inference and generalizing to diverse unseen surgical scenarios, even to a procedure type entirely absent from training.

Introduction

Overview of SA-SV benchmark and SAM2S framework

Overview of SA-SV benchmark and SAM2S framework. (a) Dataset scale comparison. (b) Category distribution and statistics of the SA-SV benchmark. (c) SAM2 for natural videos. (d) SAM2S for surgical videos with enhanced long-term tracking and domain-specific modules.

SA-SV Benchmark

The largest surgical iVOS benchmark with comprehensive masklet annotations across six procedure types

572
Videos
61K
Frames
1.6K
Masklets
6
Procedure Types

Covered Surgical Procedures

  • Cholecystectomy: Endoscapes, CholecSeg8k, CholecInstanceSeg (CIS)
  • Colonoscopy: PolypGen, Kvasir-SEG, BKAI-IGH, CVC-ClinicDB
  • Gynecology: SurgAI3.8k, AutoLaparo, ART-Net, Hyst-YT
  • Myotomy: DSAD
  • Nephrectomy: EndoVis17, EndoVis18
  • Prostatectomy: GraSP, RARP50

Proposed Method

SAM2S Architecture

Architecture overview of SAM2S. DiveMem enhances long-term tracking through diverse memory management, TSL improves instrument semantic discrimination via vision-language contrastive learning, and ARL mitigates annotation inconsistencies across multi-source datasets.

Key Innovations

DiveMem: Diverse Memory for Long-term Tracking

A trainable diverse memory mechanism that employs hybrid temporal sampling during training and diversity-based frame selection during inference. DiveMem addresses viewpoint overfitting in long-term surgical tracking by maintaining both diverse long-term memory and complete short-term memory, ensuring comprehensive temporal context that training-free approaches fail to maintain.

TSL: Temporal Semantic Learning

Leverages semantic categories of surgical instruments through vision-language contrastive learning with CLIP. TSL enables reliable instrument re-identification by incorporating a learnable CLS token that aggregates semantic context from memory features via cross-attention, then attends to the current frame to produce a temporal semantic representation. A confidence-aware gating mechanism ensures that semantic supervision is applied only when tracking confidence is high, preventing unreliable alignment from degrading performance.

ARL: Ambiguity-Resilient Learning

Handles annotation inconsistencies across multi-source datasets by adaptively modulating pixel-wise loss contribution based on prediction uncertainty within boundary regions. ARL extracts boundary regions via morphological operations and down-weights high-uncertainty boundary pixels using prediction entropy, reducing the impact of inconsistent annotations while maintaining full supervision for confident regions, thereby mitigating mask jitter and drift during long-term tracking.

Performance Highlights

Zero-shot Generalization

79.6
Macro Average J&F (3-click)

+16.6 over vanilla SAM2
+4.1 over fine-tuned SAM2

Real-time Performance

68
FPS (Frames Per Second)

Real-time inference on A6000 GPU
Suitable for clinical deployment

Long-term Tracking

Improvement over vanilla SAM2

  • CIS-Test (≈30 min): 90.3 J&F (+47.8)
  • RARP50 (325s): 71.6 J&F (+26.0)
  • Hyst-YT (329s): 86.2 J&F (+12.3)

Cross-procedure Generalization

  • EndoVis17 (unseen): 86.2 J&F
  • EndoVis18-I (unseen): 82.9 J&F
  • Strong generalization to nephrectomy, a procedure type entirely absent from training

Note: All test subsets remain completely unseen during training, ensuring rigorous evaluation of zero-shot generalization across diverse surgical procedures.

Related Work

Explore our other works on surgical video segmentation with SAM2

SAM2 Series for Surgical Applications: Our research group has developed a comprehensive suite of approaches to adapt Segment Anything Model 2 (SAM2) for surgical video understanding. Surgical SAM 2 achieves real-time performance (86 FPS) through efficient frame pruning, ReSurgSAM2 introduces text-guided referring segmentation with credible tracking initialization for language-driven surgical scene understanding, and SAM2S (this work) provides the most generalizable solution with semantic long-term tracking and diverse memory mechanisms across diverse surgical procedures.

Acknowledgement

We would like to thank National University Hospital (NUH) for providing the external hysterectomy videos. These resources were valuable for verifying the effectiveness and robustness of our model in real-world surgical scenarios.

Citation

        @article{liu2025sam2s,
          title={SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking},
          author={Liu, Haofeng and Wang, Ziyue and Mishra, Sudhanshu and Gao, Mingqi and
                  Qin, Guanyi and Low, Chang Han and Kong, Alex Y. W. and Jin, Yueming},
          journal={arXiv preprint arXiv:2511.16618},
          year={2025}
        }