ArXiv Preprint
Diffusion-based models for text-to-image generation have gained immense
popularity due to recent advancements in efficiency, accessibility, and
quality. Although it is becoming increasingly feasible to perform inference
with these systems using consumer-grade GPUs, training them from scratch still
requires access to large datasets and significant computational resources. In
the case of medical image generation, the availability of large, publicly
accessible datasets that include text reports is limited due to legal and
ethical concerns. While training a diffusion model on a private dataset may
address this issue, it is not always feasible for institutions lacking the
necessary computational resources. This work demonstrates that pre-trained
Stable Diffusion models, originally trained on natural images, can be adapted
to various medical imaging modalities by training text embeddings with textual
inversion. In this study, we conducted experiments using medical datasets
comprising only 100 samples from three medical modalities. Embeddings were
trained in a matter of hours, while still retaining diagnostic relevance in
image generation. Experiments were designed to achieve several objectives.
Firstly, we fine-tuned the training and inference processes of textual
inversion, revealing that larger embeddings and more examples are required.
Secondly, we validated our approach by demonstrating a 2\% increase in the
diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a
challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we
performed simulations by interpolating between healthy and diseased states,
combining multiple pathologies, and inpainting to show embedding flexibility
and control of disease appearance. Finally, the embeddings trained in this
study are small (less than 1 MB), which facilitates easy sharing of medical
data with reduced privacy concerns.
Bram de Wilde, Anindo Saha, Richard P. G. ten Broek, Henkjan Huisman
2023-03-23