UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation

Text | Visual | Audio Prompts

Haofeng Liu¹, Ziyue Wang², Alex Y. W. Kong¹, Guanyi Qin¹, Yunqiu Xu¹,
Chang Han Low¹, Mingqi Gao³, Lap Yan Lennon Chan⁴, Yueming Jin^1,2*

¹Department of Biomedical Engineering, National University of Singapore

²Department of Electrical and Computer Engineering, National University of Singapore

³School of Computer Science, The University of Sheffield

⁴Department of Computer Science and Engineering, The Chinese University of Hong Kong

^*Corresponding author

Paper Code arXiv

Motivation. (A) Existing single-modal PVOS methods rely on a coupled decoder that causes optimization interference, with fragile open-loop pipelines prone to hallucinations and mask drift. (B) UniSurgSAM achieves unified PVOS through the decoupled two-stage framework for stable optimization, presence-aware decoding to suppress hallucinations, boundary-aware tracking to prevent mask drift, and adaptive state transition for closed-loop failure recovery.

Unified Multi-Modal Interaction

Flexible surgical video segmentation through text, visual, and audio prompts in real time

Textual Prompts

"Segment the large needle driver on the left"

Natural language descriptions for precise target specification

Visual Prompts

Points, boxes, or masks on the initial frame

Direct spatial specification for high-precision control

Audio Prompts

Voice commands for hands-free operation

Ideal for sterile surgical environments

Video Demonstrations

UniSurgSAM achieves state-of-the-art performance with robust long-term tracking across diverse surgical procedures, multi-modal prompts, and granularity levels

1. Visual Promptable Segmentation

2. Textual Promptable Segmentation — Single Instrument

3. Textual Promptable Segmentation — Multi-Instrument

4. Textual Promptable Segmentation — Tissue

5. Textual Promptable Segmentation — Multi-Object

6. Audio Promptable Segmentation — Instrument

7. Audio Promptable Segmentation — Tissue

8. Part-Level Segmentation

Abstract

Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery.

To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence-aware decoding that models target absence to suppress hallucinations; boundary-aware long-term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery.

Furthermore, we establish a multi-modal and multi-granular benchmark from four public surgical datasets with precise instance-level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer-assisted surgery.

Key Results

State-of-the-art performance across all prompt modalities

FPS (Textual)

FPS (Visual)

+5.6

J&F on Uni-RARP50

90.5

J&F (Text PVOS)

Highlights: UniSurgSAM achieves 82.4 J&F on Uni-EndoVis17 (avg. 300s), 90.5 J&F on Uni-RARP50 (avg. 325s) for textual PVOS, and 84.5 J&F on Uni-RARP50 with visual prompts — demonstrating exceptional long-term tracking capabilities.

Challenges & Solutions

Building a unified PVOS system requires reliable initialization, robust tracking, and coordination between stages

Challenge 1: Optimization Interference

Initialization requires semantic interpretation of what to segment, while tracking requires spatial coherence of where it moves — existing coupled decoders compromise both.

→ Solution: Decoupled Two-Stage Framework — independently optimized detector and tracker.

Challenge 2: Unreliable Initialization (Stage I)

Linguistic prompts lack spatial confirmation, causing hallucinated masks when the target is temporarily absent from view.

→ Solution: Reliable Presence-Aware Decoding (RPAD) — explicit presence modeling with negative sampling to suppress hallucinations.

Challenge 3: Long-Term Mask Drift (Stage II)

Blood, smoke, and rapid view shifts obscure boundaries, causing gradual mask drift over extended procedures (300+ seconds).

→ Solution: Boundary-Aware Long-Term Tracking (BLT) — boundary supervision, geometric memory, and diversity-driven memory replacement.

Challenge 4: Open-Loop Fragility

Without coordination between stages, initialization errors and tracking drift propagate irreversibly.

→ Solution: Adaptive State Transition (AST) — credible activation before tracking, consensus-based fallback for drift recovery.

Method Overview

Overview of UniSurgSAM. The model adopts a decoupled two-stage framework that independently supports visual, textual, or audio prompts: Stage I performs promptable initialization from the given prompt, while Stage II conducts boundary-aware long-term tracking. For linguistic prompts, AST acts as a central controller that routes data to the detector or tracker via a selector, coordinating bidirectional switching through credible activation (Entry) and consensus-based fallback (Exit).

Key Components

Decoupled Two-Stage Framework

A multi-modal detector for initialization and a tracker for long-term tracking share an image encoder but use independently optimized decoders, resolving the optimization interference of existing coupled designs.

Stage I: Unified Promptable Initialization

For linguistic prompts, CSTMamba fuses text and image features for vision-language interaction; for visual prompts, spatial constraints directly activate tracking. Reliable Presence-Aware Decoding (RPAD) with negative sampling suppresses hallucinations when the target is absent, and granularity-aware decoding supports whole/part/subpart segmentation.

Stage II: Boundary-Aware Long-Term Tracking (BLT)

Explicit boundary supervision and boundary-aware memory integration prevent mask drift under surgical occlusions. Diversity-driven long-term memory retains informative frames across viewpoints and deformations for robust extended tracking.

Adaptive State Transition (AST)

Closes the loop between stages: credible activation verifies initialization temporally before activating tracking, and consensus-based fallback periodically validates tracking via semantic cues to detect drift and trigger re-initialization.

Multi-Modal Benchmarks

Comprehensive datasets with instance-level masklets for PVOS evaluation

Datasets Refined

27K+

Frames Annotated

610+

Masklets (Whole)

789+

Masklets (Part)

Dataset Details

Uni-EndoVis17: Instrument segmentation with long-duration test (avg. 300s)
Uni-EndoVis18: Instrument (-I) and tissue (-T) annotations for comprehensive evaluation
Uni-RARP50: 16,295 frames refined with instance-level masklets, long-duration test (avg. 325s)
Uni-SurgAI3.8K: Tissue segmentation benchmark for diverse anatomical structures

Key Contribution: Multi-modal prompts (visual, textual, audio) and multi-granular annotations (whole-object and part-level), transforming semantic masks into instance-level spatio-temporal masklets following da Vinci system guidelines. Instrument-related datasets are extended with fine-grained part-level masklets (e.g., "wrist of bipolar forceps"), enabling precise component tracking beyond whole-instrument segmentation.

Related Work

Explore our other works on surgical video segmentation with SAM2

SAM2 Series for Surgical Applications: Our research group has developed a comprehensive suite of approaches to adapt SAM2 for surgical video understanding. Surgical SAM 2 achieves real-time performance through efficient frame pruning, SAM2S provides generalizable visual PVOS with semantic long-term tracking, ReSurgSAM2 introduces text-guided referring segmentation, and UniSurgSAM (this work) unifies all modalities with enhanced reliability and consistency.

Citation

@article{liu2026unisurgsam,
  title={UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation},
  author={Liu, Haofeng and Wang, Ziyue and Kong, Alex Y. W. and Qin, Guanyi and Xu, Yunqiu and
          Low, Chang Han and Gao, Mingqi and Chan, Lap Yan Lennon and Jin, Yueming},
  year={2026}
}