SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment

A groundbreaking advancement in **robot-assisted surgery (RAS)** analytics has emerged with the introduction of **SurgFusion-Net** and **Divergence Regulated Attention (DRA)**, a novel multimodal fusion strategy designed for highly accurate surgical skill assessment. This innovative approach, det...

SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment
A groundbreaking advancement in **robot-assisted surgery (RAS)** analytics has emerged with the introduction of **SurgFusion-Net** and **Divergence Regulated Attention (DRA)**, a novel multimodal fusion strategy designed for highly accurate surgical skill assessment. This innovative approach, detailed in a recent arXiv publication (arXiv:2603.00108v1), not only addresses critical limitations of existing methods but also contributes two first-of-their-kind clinical datasets, paving the way for more robust and realistic surgical training and evaluation. By integrating data from multiple modalities, SurgFusion-Net significantly enhances the reliability and precision of automated surgical performance feedback, moving beyond the confines of simulated environments to real clinical scenarios.

Advancing AI in Surgical Skill Assessment

Automated surgical skill assessment holds immense transformative potential for surgical analytics and education, offering objective, data-driven feedback crucial for training the next generation of surgeons. However, developing effective multimodal methods for this purpose has been a persistent challenge. Current state-of-the-art systems largely rely on **RGB video** alone and are predominantly validated in **dry-lab settings**, creating a significant "domain gap" when applied to the complexities of real clinical cases. In live surgery, factors such as dynamic surgical environments, unpredictable camera movements, and tissue motion introduce substantial complexities that traditional models struggle to manage.

Addressing the Domain Gap with Multimodal Data

The limitations of existing systems stem from their inability to effectively fuse diverse data streams and their reliance on controlled, simulated environments. This gap between simulation and clinical reality often renders dry-lab-trained models less effective in actual operating room scenarios. The new research directly tackles this by emphasizing **multimodal information fusion**, recognizing that a comprehensive understanding of surgical performance requires more than just visual cues. By incorporating various data streams, the system can better contextualize actions and provide more nuanced assessments.

Introducing SurgFusion-Net and Divergence Regulated Attention

At the core of this innovation is **SurgFusion-Net**, a sophisticated deep learning architecture specifically designed for multimodal surgical skill assessment. Its key component is **Divergence Regulated Attention (DRA)**, an innovative fusion strategy engineered to process and integrate information from three distinct modalities. DRA is designed to adaptively focus on relevant data based on the surgical context, ensuring that the most pertinent information from each modality contributes to the overall skill assessment.

The Role of Divergence Regulated Attention (DRA)

**DRA** employs both **adaptive dual attention** and **diversity-promoting multi-head attention** mechanisms. This dual-attention approach allows the system to not only weigh the importance of different modalities dynamically but also to capture diverse aspects of surgical performance from each data stream. By intelligently fusing information from multiple sources, DRA significantly improves the accuracy and reliability of skill assessment, making the evaluation process more comprehensive and less susceptible to the noise and variability inherent in clinical data.

New Clinical Datasets for Real-World Training

A critical contribution of this work is the development of two first-of-their-kind **clinical datasets**, which are essential for training and validating AI models in real-world surgical contexts. These datasets offer an unprecedented level of real clinical data, directly addressing the scarcity of annotated resources for multimodal surgical AI.

The RAH-skill and RARP-skill Datasets

The first dataset, **RAH-skill**, comprises **279,691 RGB frames** extracted from **37 videos** of **Robot-assisted Hysterectomy (RAH)** procedures. The second, **RARP-skill**, includes **70,661 RGB frames** from **33 videos** of **Robot-Assisted Radical Prostatectomy (RARP)**. Both datasets are meticulously annotated with **M-GEARS skill scores**, providing a standardized measure of surgical proficiency. Crucially, they also include corresponding **optical flow** data, which captures motion dynamics, and **tool segmentation masks**, which identify and track surgical instruments. This rich, multimodal clinical data is invaluable for training robust AI models capable of understanding complex surgical actions.

Performance and Impact

The efficacy of **SurgFusion-Net** and **DRA** has been rigorously validated across multiple benchmarks, demonstrating superior performance compared to recent baselines. The approach showed significant improvements on the well-established **JIGSAWS benchmark**, with **SCC (Spearman's Correlation Coefficient) improvements of 0.02 in LOSO (Leave-One-Subject-Out)** and **0.04 in LOUO (Leave-One-User-Out)** across various tasks. More importantly, when evaluated on the newly introduced clinical datasets, the model achieved **0.0538 gains on RAH-skill** and **0.0493 gains on RARP-skill**, underscoring its ability to perform effectively in real clinical settings. These results highlight a substantial leap forward in the practical application of AI for surgical skill assessment.

Why This Matters

  • Enhanced Surgical Training: Provides objective, data-driven feedback for surgeons in training, accelerating skill acquisition and refinement.
  • Improved Patient Safety: By ensuring higher skill levels through better assessment, the technology can indirectly contribute to safer surgical outcomes.
  • Bridging the Domain Gap: Overcomes limitations of dry-lab simulations, making AI models more relevant and effective in actual operating rooms.
  • Multimodal Data Integration: Establishes a new standard for fusing diverse data types (RGB, optical flow, tool segmentation) for a holistic understanding of surgical performance.
  • Clinical Data Contribution: The release of **RAH-skill** and **RARP-skill** datasets provides invaluable resources for future research and development in **AI in healthcare**.
  • E-E-A-T in AI for Surgery: This development reinforces the growing expertise and authority of AI in delivering trustworthy and effective solutions for complex medical challenges.