Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max). Extensive experimental results show that the proposed SurgVLM consistently surpasses mainstream commercial VLMs. SurgVLM-72B achieves 75.4\% improvement on overall arena score compared with Gemini 2.0 Flash, including a 96.5\% improvement in phase recognition, 87.7\% in action recognition, 608.1\% in triplet prediction, 198.5\% in instrument localization, 28.9\% in critical view safety detection, and 59.4\% in a comprehensive multi-task VQA dataset. Beyond raw performance gains, SurgVLM demonstrates robust generalization and open‑vocabulary QA, establishing a scalable, accurate, and clinically reliable paradigm for unified surgical intelligence.
✨ Highlights:
SurgVLM-DB: A Large-scale Surgical Multimodal Database
SurgVLM-Bench: A Comprehensive Surgical Benchmarks for VLMs
SurgVLM: A Large Vision-Language Model for Surgical Intelligence
The leaderboard is ranked by the Arena Score obtained by summing the most important metrics across six surgical tasks. Higher scores indicate better performance.
Rank | Model | Institute | Evaluation | Arena Score ↑ | Phase Acc | Action Acc | Triplet Acc | CVS Acc | VQA Acc | Loc mIoU |
---|---|---|---|---|---|---|---|---|---|---|
1 | SurgVLM-72B (Ours) | iMVR Lab | MCQ | 336.21 | 69.66 | 43.1 | 12.52 | 76.73 | 75.2 | 59.0 |
2 | SurgVLM-72B (Ours) | iMVR Lab | OV | 331.86 | 76.40 | 42.9 | 13.10 | 76.60 | 63.46 | 59.4 |
3 | SurgVLM-32B (Ours) | iMVR Lab | OV | 306.91 | 71.20 | 40.1 | 12.98 | 74.51 | 59.72 | 48.4 |
4 | SurgVLM-7B (Ours) | iMVR Lab | OV | 290.78 | 70.30 | 45.8 | 4.15 | 76.86 | 59.67 | 34.0 |
5 | Gemini 2.0 Flash | Google DeepMind | MCQ | 191.70 | 38.89 | 24.4 | 1.85 | 59.61 | 47.05 | 19.9 |
6 | Qwen2.5-VL-72B-Instruct | Alibaba Cloud | MCQ | 184.85 | 29.30 | 28.2 | 1.27 | 41.69 | 42.19 | 42.2 |
7 | Qwen2.5-VL-32B-Instruct | Alibaba Cloud | MCQ | 184.40 | 37.23 | 31.8 | 0.98 | 60.53 | 42.46 | 11.4 |
8 | Qwen2.5-VL-7B-Instruct | Alibaba Cloud | MCQ | 175.20 | 30.45 | 31.1 | 0.35 | 65.88 | 36.82 | 10.6 |
9 | Qwen 2.5 Max | Alibaba Cloud | MCQ | 174.37 | 34.79 | 28.3 | 0.35 | 34.77 | 36.16 | 40.0 |
10 | InternVL3-78B | Shanghai AI Lab | MCQ | 172.97 | 27.32 | 29.5 | 0.52 | 50.20 | 36.33 | 29.1 |
11 | Llama-4-Scout-17B-16E-Instruct | Meta AI | MCQ | 163.84 | 35.77 | 25.1 | 0.58 | 37.39 | 37.00 | 28.0 |
12 | Mistral-Small-3.1-24B-Instruct-2503 | Mistral AI | MCQ | 156.98 | 22.61 | 12.5 | 0.46 | 68.10 | 36.41 | 16.9 |
13 | InternVL3-8B | Shanghai AI Lab | MCQ | 146.42 | 23.88 | 29.3 | 2.08 | 48.24 | 34.72 | 8.2 |
14 | MiniCPM-O-2_6 | ModelBest | MCQ | 140.34 | 17.75 | 30.8 | 0.06 | 35.95 | 35.48 | 20.3 |
15 | Gemma3-27B-it | Google DeepMind | MCQ | 138.93 | 14.08 | 33.2 | 0.06 | 38.04 | 35.95 | 17.6 |
16 | Phi-4-Multimodal-Instruct | Microsoft | MCQ | 131.10 | 22.45 | 15.1 | 0.12 | 58.43 | 34.20 | 0.8 |
17 | MiniCPM-V-2_6 | MiniCPM Team | MCQ | 128.77 | 15.20 | 24.3 | 0 | 38.69 | 33.28 | 17.3 |
18 | GPT-4o | OpenAI | MCQ | 118.71 | 36.43 | 28.1 | 1.50 | 6.67 | 38.31 | 7.7 |
19 | LLava-1.5-7B | WAIV Lab | MCQ | 112.57 | 23.46 | 5.1 | 0 | 25.49 | 31.42 | 27.1 |
20 | Skywork-R1V-38B | Skywork AI | MCQ | 107.64 | 6.37 | 12.3 | 0 | 43.79 | 34.58 | 10.6 |
Table 1. Quantitive comparison of VLMs on SurgVLM-Bench across six surgical tasks.
We provide a selection of images in SurgVLM-DB. If you're interested in exploring more, please refer to our complete dataset.
Task: Instrument Recognition
Can you name the category for each surgical instrument?
Their categories are Dissecting and grasping forceps.
Task: Instrument Localization
Can you locate the Bipolar Forceps in the bottom left area?
The location is [0.179, 0.75, 0.398, 0.999].
Task: Instrument Location
Which parts of the image show instruments?
The top right area, middle right area identifies the parts where instruments are visible.
Task: Tissue Recognition
Which tissues are being shown in the surgical picture?
The tissue being dissected is hepatocystic triangle.
Task: Tissue Localization
Mark the Cystic Artery's position in the bottom center area.
The location is [0.357, 0.812, 0.6, 0.969].
Task: Phase Recognition
Based on this image from Cholecystectomy surgery, identify the surgical phase.
The identified phase is gallbladder retraction. In this phase, the gallbladder enclosed in a specimen bag is extracted from the abdominal cavity through a trocar to complete the procedure.
Task: Step Recognition
What procedural step does this Prostatectomies surgery image illustrate?
The demonstrated step is passing a suture through the bladder neck, and during the step, the surgeon is passing a suture through the bladder neck to aid in urethrovesical anastomosis.
Task: Action Recognition
What action is the surgeon executing with the needle and suture at this point?
The surgeon is currently pushing the needle through the tissue, advancing it precisely to ensure proper placement.
Task: Triplet Recognition
What task is the tool accomplishing with the target in this surgical image?
The tool, a grasper, is performing a retract action on the gallbladder, and the instrument hook is performing a dissect action on the gallbladder in this surgical operation.
Task: Critical View of Safety
Confirm if each critical view criterion is satisfied: (1) Two tubular structures connected to the gallbladder are visible; (2) The hepatocystic triangle has been cleared of obstruction; (3) The lower gallbladder is dissected to expose the cystic plate.
Criterion 1: No; Criterion 2: No; Criterion 3: Yes.