ArXiv Preprint
Contrastive pretraining on parallel image-text data has attained great
success in vision-language processing (VLP), as exemplified by CLIP and related
methods. However, prior explorations tend to focus on general domains in the
web. Biomedical images and text are rather different, but publicly available
datasets are small and skew toward chest X-ray, thus severely limiting
progress. In this paper, we conducted by far the largest study on biomedical
VLP, using 15 million figure-caption pairs extracted from biomedical research
articles in PubMed Central. Our dataset (PMC-15M) is two orders of magnitude
larger than existing biomedical image-text datasets such as MIMIC-CXR, and
spans a diverse range of biomedical images. The standard CLIP method is
suboptimal for the biomedical domain. We propose BiomedCLIP with
domain-specific adaptations tailored to biomedical VLP. We conducted extensive
experiments and ablation studies on standard biomedical imaging tasks from
retrieval to classification to visual question-answering (VQA). BiomedCLIP
established new state of the art in a wide range of standard datasets,
substantially outperformed prior VLP approaches. Surprisingly, BiomedCLIP even
outperformed radiology-specific state-of-the-art models such as BioViL on
radiology-specific tasks such as RSNA pneumonia detection, thus highlighting
the utility in large-scale pretraining across all biomedical image types. We
will release our models at https://aka.ms/biomedclip to facilitate future
research in biomedical VLP.
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, Hoifung Poon
2023-03-02