SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Haofeng Liu1, Ziyue Wang1, Sudhanshu Mishra1, Mingqi Gao2,
Guanyi Qin1, Chang Han Low1, Alex Y. W. Kong1, Yueming Jin1*
1National University of Singapore
2University of Sheffield
*Corresponding author

Video Demonstrations

SAM2S achieves real-time performance at 68 FPS with robust long-term tracking across diverse surgical procedures

CIS-Test Dataset: Long-term tracking in cholecystectomy

EndoVis18 Dataset: Zero-shot generalization on unseen nephrectomy procedures

RARP50 Dataset: Robust segmentation in robot-assisted radical prostatectomy

Abstract

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing SAM2 for Surgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) Temporal Semantic Learning (TSL) for instrument understanding; and (3) Ambiguity-Resilient Learning (ARL) to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average J&F over vanilla SAM2. SAM2S further advances performance to 80.42 average J&F, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization.

SA-SV Benchmark

The largest surgical iVOS benchmark with comprehensive segmentation annotations

572
Videos
61K
Frames
1.6K
Masklets
8
Procedure Types

Covered Surgical Procedures

  • Cholecystectomy: Endoscapes, CholecSeg8k, CholecInstanceSeg (CIS)
  • Colonoscopy: PolypGen, Kvasir-SEG, BKAI-IGH, CVC-ClinicDB
  • Gynecology: SurgAI3.8k
  • Hysterectomy: AutoLaparo, ART-Net, Hyst-YT
  • Myotomy: DSAD
  • Nephrectomy: EndoVis17, EndoVis18
  • Prostatectomy: GraSP, RARP50
  • Multi-procedural: RoboTool

Proposed Method

SAM2S Architecture

Overview of SAM2S framework integrating DiveMem, TSL, and ARL for robust surgical video segmentation

Key Innovations

DiveMem: Diverse Memory for Long-term Tracking

A trainable diverse memory mechanism that employs hybrid temporal sampling during training and diversity-based frame selection during inference. DiveMem addresses viewpoint overfitting in long-term surgical tracking by maintaining both diverse long-term memory and complete short-term memory, ensuring comprehensive temporal context that training-free approaches fail to maintain.

TSL: Temporal Semantic Learning

Leverages semantic categories of surgical instruments through vision-language contrastive learning with CLIP. TSL enables semantic-aware tracking by incorporating learnable CLS tokens that attend to memory features and perform cross-attention with current frame features, while preserving class-agnostic generalization capability.

ARL: Ambiguity-Resilient Learning

Handles annotation inconsistencies across multi-source datasets through uniform label softening using Gaussian kernel convolution. ARL transforms discrete annotation spaces into continuous probability distributions, improving model calibration and robustness at ambiguous tissue boundaries while mitigating conflicting supervision signals from varying labeling standards.

Performance Highlights

Zero-shot Generalization

80.42
Average J&F (3-click)

+17.10 over vanilla SAM2
+4.11 over fine-tuned SAM2

Real-time Performance

68
FPS (Frames Per Second)

Real-time inference on A6000 GPU
Suitable for clinical deployment

Long-term Tracking

  • CIS-Test (≈30 min): 89.65 J&F (+9.56)
  • RARP50 (325s): 79.47 J&F (+2.96)
  • Hyst-YT (329s): 87.46 J&F (+3.57)

Cross-procedure Generalization

  • EndoVis17 (unseen): 86.72 J&F
  • EndoVis18-I (unseen): 82.37 J&F
  • Strong performance on nephrectomy without training data

Note: All test subsets remain completely unseen during training, ensuring rigorous evaluation of zero-shot generalization across diverse surgical procedures.

Citation

@article{liu2025sam2s,
  title={SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking},
  author={Liu, Haofeng and Wang, Ziyue and Mishra, Sudhanshu and Gao, Mingqi and
          Qin, Guanyi and Low, Chang Han and Kong, Alex Y. W. and Jin, Yueming},
  journal={arXiv preprint arXiv:2511.16618},
  year={2025}
}