ArXiv Preprint
Deep learning models can be applied successfully in real-work problems;
however, training most of these models requires massive data. Recent methods
use language and vision, but unfortunately, they rely on datasets that are not
usually publicly available. Here we pave the way for further research in the
multimodal language-vision domain for radiology. In this paper, we train a
representation learning method that uses local and global representations of
the language and vision through an attention mechanism and based on the
publicly available Indiana University Radiology Report (IU-RR) dataset.
Furthermore, we use the learned representations to diagnose five lung
pathologies: atelectasis, cardiomegaly, edema, pleural effusion, and
consolidation. Finally, we use both supervised and zero-shot classifications to
extensively analyze the performance of the representation learning on the IU-RR
dataset. Average Area Under the Curve (AUC) is used to evaluate the accuracy of
the classifiers for classifying the five lung pathologies. The average AUC for
classifying the five lung pathologies on the IU-RR test set ranged from 0.85 to
0.87 using the different training datasets, namely CheXpert and CheXphoto.
These results compare favorably to other studies using UI-RR. Extensive
experiments confirm consistent results for classifying lung pathologies using
the multimodal global local representations of language and vision information.
Nathan Hadjiyski, Ali Vosoughi, Axel Wismueller
2023-01-26