SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

Zhitao Zeng1,†, Zhu Zhuo1,†, Xiaojun Jia2,†, Erli Zhang1,†, Junde Wu3,†
1National University of Singapore, 2Nanyang Technological University, 3University of Oxford, 4State Key Laboratory of General Artificial Intelligence, BIGAI, 5Shanghai Jiao Tong University, 6Sun Yat-sen University, 7The Chinese University of Hong Kong
Figure 1. Illustration of our Surgical Multimodal database SurgVLM-DB. a, SurgVLM-DB contains 16 surgical types and 18 anatomical structures, reflecting the broad diversity of SurgVLM-DB. b, SurgVLM-DB contain 1.181M annotated images with 7.79M conversations, and the distribution of video numbers demonstrates large-scale comprehensive data in various surgical types, ensuring the robustness of SurgVLM.

Abstract

Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max). Extensive experimental results show that the proposed SurgVLM consistently surpasses mainstream commercial VLMs. SurgVLM-72B achieves 75.4\% improvement on overall arena score compared with Gemini 2.0 Flash, including a 96.5\% improvement in phase recognition, 87.7\% in action recognition, 608.1\% in triplet prediction, 198.5\% in instrument localization, 28.9\% in critical view safety detection, and 59.4\% in a comprehensive multi-task VQA dataset. Beyond raw performance gains, SurgVLM demonstrates robust generalization and open‑vocabulary QA, establishing a scalable, accurate, and clinically reliable paradigm for unified surgical intelligence.

Highlights:

  1. SurgVLM-DB: A Large-scale Surgical Multimodal Database

  2. SurgVLM-Bench: A Comprehensive Surgical Benchmarks for VLMs

  3. SurgVLM: A Large Vision-Language Model for Surgical Intelligence

Figure 2. Illustration of our Surgical Multimodal Database, SurgVLM-DB. a, Construction pipeline can be divided into four modules. b, Overview of task hierarchy in SurgVLM-DB containing 10 surgical tasks, ranging from visual perception to temporal analysis to reasoning. c, Comparison showing SurgVLM-DB has the largest number of images, conversations, and surgical types indicated by the size of bubble to our knowledge. d, Distribution of conversations according to 10 surgical tasks.
Figure 3. Comparison of SurgVLM and 14 mainstream commercial VLMs. a, Leaderboard on SurgVLM-Bench by overall arena score, demonstrating superior performance of our SurgVLM compared with other mainstream commercial VLMs. b, Comprehensive comparison with Gemini 2.0 Flash, Qwen2.5 Max, and GPT-4o. SurgVLM consistently outperforms commercial VLMs across 24 metrics. c, Detailed comparison with 14 mainstream commercial VLMs on the most important metrics of six popular surgical datasets. SurgVLM models achieve state-of-the-art performance across all metrics.

Leaderboard of Vision-Language Models on SurgVLM-Bench

Submit Your Results: zhitao@nus.edu.sg

The leaderboard is ranked by the Arena Score obtained by summing the most important metrics across six surgical tasks. Higher scores indicate better performance.

Rank Model Institute Evaluation Arena Score ↑ Phase Acc Action Acc Triplet Acc CVS Acc VQA Acc Loc mIoU
1SurgVLM-72B (Ours)iMVR LabMCQ336.2169.6643.112.5276.7375.259.0
2SurgVLM-72B (Ours)iMVR LabOV331.8676.4042.913.1076.6063.4659.4
3SurgVLM-32B (Ours)iMVR LabOV306.9171.2040.112.9874.5159.7248.4
4SurgVLM-7B (Ours)iMVR LabOV290.7870.3045.84.1576.8659.6734.0
5Gemini 2.0 FlashGoogle DeepMindMCQ191.7038.8924.41.8559.6147.0519.9
6Qwen2.5-VL-72B-InstructAlibaba CloudMCQ184.8529.3028.21.2741.6942.1942.2
7Qwen2.5-VL-32B-InstructAlibaba CloudMCQ184.4037.2331.80.9860.5342.4611.4
8Qwen2.5-VL-7B-InstructAlibaba CloudMCQ175.2030.4531.10.3565.8836.8210.6
9Qwen 2.5 MaxAlibaba CloudMCQ174.3734.7928.30.3534.7736.1640.0
10InternVL3-78BShanghai AI LabMCQ172.9727.3229.50.5250.2036.3329.1
11Llama-4-Scout-17B-16E-InstructMeta AIMCQ163.8435.7725.10.5837.3937.0028.0
12Mistral-Small-3.1-24B-Instruct-2503Mistral AIMCQ156.9822.6112.50.4668.1036.4116.9
13InternVL3-8BShanghai AI LabMCQ146.4223.8829.32.0848.2434.728.2
14MiniCPM-O-2_6ModelBestMCQ140.3417.7530.80.0635.9535.4820.3
15Gemma3-27B-itGoogle DeepMindMCQ138.9314.0833.20.0638.0435.9517.6
16Phi-4-Multimodal-InstructMicrosoftMCQ131.1022.4515.10.1258.4334.200.8
17MiniCPM-V-2_6MiniCPM TeamMCQ128.7715.2024.3038.6933.2817.3
18GPT-4oOpenAIMCQ118.7136.4328.11.506.6738.317.7
19LLava-1.5-7BWAIV LabMCQ112.5723.465.1025.4931.4227.1
20Skywork-R1V-38BSkywork AIMCQ107.646.3712.3043.7934.5810.6

Table 1. Quantitive comparison of VLMs on SurgVLM-Bench across six surgical tasks.

Data Samples

We provide a selection of images in SurgVLM-DB. If you're interested in exploring more, please refer to our complete dataset.