Key Components
Decoupled Two-Stage Framework
A multi-modal detector for initialization and a tracker for long-term tracking share an image encoder but use independently optimized decoders, resolving the optimization interference of existing coupled designs.
Stage I: Unified Promptable Initialization
For linguistic prompts, CSTMamba fuses text and image features for vision-language interaction; for visual prompts, spatial constraints directly activate tracking. Reliable Presence-Aware Decoding (RPAD) with negative sampling suppresses hallucinations when the target is absent, and granularity-aware decoding supports whole/part/subpart segmentation.
Stage II: Boundary-Aware Long-Term Tracking (BLT)
Explicit boundary supervision and boundary-aware memory integration prevent mask drift under surgical occlusions. Diversity-driven long-term memory retains informative frames across viewpoints and deformations for robust extended tracking.
Adaptive State Transition (AST)
Closes the loop between stages: credible activation verifies initialization temporally before activating tracking, and consensus-based fallback periodically validates tracking via semantic cues to detect drift and trigger re-initialization.