ArXiv Preprint
Automatic recognition of fine-grained surgical activities, called steps, is a
challenging but crucial task for intelligent intra-operative computer
assistance. The development of current vision-based activity recognition
methods relies heavily on a high volume of manually annotated data. This data
is difficult and time-consuming to generate and requires domain-specific
knowledge. In this work, we propose to use coarser and easier-to-annotate
activity labels, namely phases, as weak supervision to learn step recognition
with fewer step annotated videos. We introduce a step-phase dependency loss to
exploit the weak supervision signal. We then employ a Single-Stage Temporal
Convolutional Network (SS-TCN) with a ResNet-50 backbone, trained in an
end-to-end fashion from weakly annotated videos, for temporal activity
segmentation and recognition. We extensively evaluate and show the
effectiveness of the proposed method on a large video dataset consisting of 40
laparoscopic gastric bypass procedures and the public benchmark CATARACTS
containing 50 cataract surgeries.
Sanat Ramesh, Diego Dall’Alba, Cristians Gonzalez, Tong Yu, Pietro Mascagni, Didier Mutter, Jacques Marescaux, Paolo Fiorini, Nicolas Padoy
2023-02-21