Learning medical visual representations directly from paired radiology
reports has become an emerging topic in representation learning. However,
existing medical image-text joint learning methods are limited by instance or
local supervision analysis, ignoring disease-level semantic correspondences. In
this paper, we present a novel Multi-Granularity Cross-modal Alignment (MGCA)
framework for generalized medical visual representation learning by harnessing
the naturally exhibited semantic correspondences between medical image and
radiology reports at three different levels, i.e., pathological region-level,
instance-level, and disease-level. Specifically, we first incorporate the
instance-wise alignment module by maximizing the agreement between image-report
pairs. Further, for token-wise alignment, we introduce a bidirectional
cross-attention strategy to explicitly learn the matching between fine-grained
visual tokens and text tokens, followed by contrastive learning to align them.
More important, to leverage the high-level inter-subject relationship semantic
(e.g., disease) correspondences, we design a novel cross-modal disease-level
alignment paradigm to enforce the cross-modal cluster assignment consistency.
Extensive experimental results on seven downstream medical image datasets
covering image classification, object detection, and semantic segmentation
tasks demonstrate the stable and superior performance of our framework.
Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanabhuti, Lequan Yu