ArXiv Preprint
Automatic radiology report generation is essential for computer-aided
diagnosis and medication guidance. Importantly, automatic radiology report
generation (RRG) can relieve the heavy burden of radiologists by generating
medical reports automatically from visual-linguistic data relations. However,
due to the spurious correlations within image-text data induced by visual and
linguistic biases, it is challenging to generate accurate reports that reliably
describe abnormalities. Besides, the cross-modal confounder is usually
unobservable and difficult to be eliminated explicitly. In this paper, we
mitigate the cross-modal data bias for RRG from a new perspective, i.e.,
visual-linguistic causal intervention, and propose a novel Visual-Linguistic
Causal Intervention (VLCI) framework for RRG, which consists of a visual
deconfounding module (VDM) and a linguistic deconfounding module (LDM), to
implicitly deconfound the visual-linguistic confounder by causal front-door
intervention. Specifically, the VDM explores and disentangles the visual
confounder from the patch-based local and global features without object
detection due to the absence of universal clinic semantic extraction.
Simultaneously, the LDM eliminates the linguistic confounder caused by salient
visual features and high-frequency context without constructing specific
dictionaries. Extensive experiments on IU-Xray and MIMIC-CXR datasets show that
our VLCI outperforms the state-of-the-art RRG methods significantly. Source
code and models are available at https://github.com/WissingChen/VLCI.
Weixing Chen, Yang Liu, Ce Wang, Guanbin Li, Jiarui Zhu, Liang Lin
2023-03-16