SurgVLM

Abstract

Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max). Extensive experimental results show that the proposed SurgVLM consistently surpasses mainstream commercial VLMs. SurgVLM-72B achieves 75.4\% improvement on overall arena score compared with Gemini 2.0 Flash, including a 96.5\% improvement in phase recognition, 87.7\% in action recognition, 608.1\% in triplet prediction, 198.5\% in instrument localization, 28.9\% in critical view safety detection, and 59.4\% in a comprehensive multi-task VQA dataset. Beyond raw performance gains, SurgVLM demonstrates robust generalization and open‑vocabulary QA, establishing a scalable, accurate, and clinically reliable paradigm for unified surgical intelligence.

✨ Highlights:

SurgVLM-DB: A Large-scale Surgical Multimodal Database

SurgVLM-Bench: A Comprehensive Surgical Benchmarks for VLMs

SurgVLM: A Large Vision-Language Model for Surgical Intelligence

Leaderboard of Vision-Language Models on SurgVLM-Bench

Submit Your Results: zhitao@nus.edu.sg

The leaderboard is ranked by the Arena Score obtained by summing the most important metrics across six surgical tasks. Higher scores indicate better performance.

Rank	Model	Institute	Evaluation	Arena Score ↑	Phase Acc	Action Acc	Triplet Acc	CVS Acc	VQA Acc	Loc mIoU
1	SurgVLM-72B (Ours)	iMVR Lab	MCQ	336.21	69.66	43.1	12.52	76.73	75.2	59.0
2	SurgVLM-72B (Ours)	iMVR Lab	OV	331.86	76.40	42.9	13.10	76.60	63.46	59.4
3	SurgVLM-32B (Ours)	iMVR Lab	OV	306.91	71.20	40.1	12.98	74.51	59.72	48.4
4	SurgVLM-7B (Ours)	iMVR Lab	OV	290.78	70.30	45.8	4.15	76.86	59.67	34.0
5	Gemini 2.0 Flash	Google DeepMind	MCQ	191.70	38.89	24.4	1.85	59.61	47.05	19.9
6	Qwen2.5-VL-72B-Instruct	Alibaba Cloud	MCQ	184.85	29.30	28.2	1.27	41.69	42.19	42.2
7	Qwen2.5-VL-32B-Instruct	Alibaba Cloud	MCQ	184.40	37.23	31.8	0.98	60.53	42.46	11.4
8	Qwen2.5-VL-7B-Instruct	Alibaba Cloud	MCQ	175.20	30.45	31.1	0.35	65.88	36.82	10.6
9	Qwen 2.5 Max	Alibaba Cloud	MCQ	174.37	34.79	28.3	0.35	34.77	36.16	40.0
10	InternVL3-78B	Shanghai AI Lab	MCQ	172.97	27.32	29.5	0.52	50.20	36.33	29.1
11	Llama-4-Scout-17B-16E-Instruct	Meta AI	MCQ	163.84	35.77	25.1	0.58	37.39	37.00	28.0
12	Mistral-Small-3.1-24B-Instruct-2503	Mistral AI	MCQ	156.98	22.61	12.5	0.46	68.10	36.41	16.9
13	InternVL3-8B	Shanghai AI Lab	MCQ	146.42	23.88	29.3	2.08	48.24	34.72	8.2
14	MiniCPM-O-2_6	ModelBest	MCQ	140.34	17.75	30.8	0.06	35.95	35.48	20.3
15	Gemma3-27B-it	Google DeepMind	MCQ	138.93	14.08	33.2	0.06	38.04	35.95	17.6
16	Phi-4-Multimodal-Instruct	Microsoft	MCQ	131.10	22.45	15.1	0.12	58.43	34.20	0.8
17	MiniCPM-V-2_6	MiniCPM Team	MCQ	128.77	15.20	24.3	0	38.69	33.28	17.3
18	GPT-4o	OpenAI	MCQ	118.71	36.43	28.1	1.50	6.67	38.31	7.7
19	LLava-1.5-7B	WAIV Lab	MCQ	112.57	23.46	5.1	0	25.49	31.42	27.1
20	Skywork-R1V-38B	Skywork AI	MCQ	107.64	6.37	12.3	0	43.79	34.58	10.6

Table 1. Quantitive comparison of VLMs on SurgVLM-Bench across six surgical tasks.

Data Samples

We provide a selection of images in SurgVLM-DB. If you're interested in exploring more, please refer to our complete dataset.

Select Task Type:

Task: Instrument Recognition

Can you name the category for each surgical instrument?

Their categories are Dissecting and grasping forceps.

Task: Instrument Localization

Can you locate the Bipolar Forceps in the bottom left area?

The location is [0.179, 0.75, 0.398, 0.999].

Task: Instrument Location

Which parts of the image show instruments?

The top right area, middle right area identifies the parts where instruments are visible.

Task: Tissue Recognition

Which tissues are being shown in the surgical picture?

The tissue being dissected is hepatocystic triangle.

Task: Tissue Localization

Mark the Cystic Artery's position in the bottom center area.

The location is [0.357, 0.812, 0.6, 0.969].

Task: Phase Recognition

Based on this image from Cholecystectomy surgery, identify the surgical phase.

The identified phase is gallbladder retraction. In this phase, the gallbladder enclosed in a specimen bag is extracted from the abdominal cavity through a trocar to complete the procedure.

Task: Step Recognition

What procedural step does this Prostatectomies surgery image illustrate?

The demonstrated step is passing a suture through the bladder neck, and during the step, the surgeon is passing a suture through the bladder neck to aid in urethrovesical anastomosis.

Task: Action Recognition

What action is the surgeon executing with the needle and suture at this point?

The surgeon is currently pushing the needle through the tissue, advancing it precisely to ensure proper placement.

Task: Triplet Recognition

What task is the tool accomplishing with the target in this surgical image?

The tool, a grasper, is performing a retract action on the gallbladder, and the instrument hook is performing a dissect action on the gallbladder in this surgical operation.

Task: Critical View of Safety

Confirm if each critical view criterion is satisfied: (1) Two tubular structures connected to the gallbladder are visible; (2) The hepatocystic triangle has been cleared of obstruction; (3) The lower gallbladder is dissected to expose the cystic plate.

Criterion 1: No; Criterion 2: No; Criterion 3: Yes.

SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

Abstract

Leaderboard of Vision-Language Models on SurgVLM-Bench

Submit Your Results: zhitao@nus.edu.sg

Data Samples