ArXiv Preprint
Manual labeling of gestures in robot-assisted surgery is labor intensive,
prone to errors, and requires expertise or training. We propose a method for
automated and explainable generation of gesture transcripts that leverages the
abundance of data for image segmentation to train a surgical scene segmentation
model that provides surgical tool and object masks. Surgical context is
detected using segmentation masks by examining the distances and intersections
between the tools and objects. Next, context labels are translated into gesture
transcripts using knowledge-based Finite State Machine (FSM) and data-driven
Long Short Term Memory (LSTM) models. We evaluate the performance of each stage
of our method by comparing the results with the ground truth segmentation
masks, the consensus context labels, and the gesture labels in the JIGSAWS
dataset. Our results show that our segmentation models achieve state-of-the-art
performance in recognizing needle and thread in Suturing and we can
automatically detect important surgical states with high agreement with
crowd-sourced labels (e.g., contact between graspers and objects in Suturing).
We also find that the FSM models are more robust to poor segmentation and
labeling performance than LSTMs. Our proposed method can significantly shorten
the gesture labeling process (~2.8 times).
Kay Hutchinson, Zongyu Li, Ian Reyes, Homa Alemzadeh
2023-02-28