ArXiv Preprint
In digital pathology, Whole Slide Image (WSI) analysis is usually formulated
as a Multiple Instance Learning (MIL) problem. Although transformer-based
architectures have been used for WSI classification, these methods require
modifications to adapt them to specific challenges of this type of image data.
Despite their power across domains, reference transformer models in classical
Computer Vision (CV) and Natural Language Processing (NLP) tasks are not used
for pathology slide analysis. In this work we demonstrate the use of standard,
frozen, text-pretrained, transformer language models in application to WSI
classification. We propose SeqShort, a multi-head attention-based sequence
reduction input layer to summarize each WSI in a fixed and short size sequence
of instances. This allows us to reduce the computational costs of
self-attention on long sequences, and to include positional information that is
unavailable in other MIL approaches. We demonstrate the effectiveness of our
methods in the task of cancer subtype classification, without the need of
designing a WSI-specific transformer or performing in-domain self-supervised
pretraining, while keeping a reduced compute budget and number of trainable
parameters.
Juan I. Pisula, Katarzyna Bozek
2022-11-14