ArXiv Preprint
This study presents three deidentified large medical text datasets, named
DISCHARGE, ECHO and RADIOLOGY, which contain 50K, 16K and 378K pairs of report
and summary that are derived from MIMIC-III, respectively. We implement
convincing baselines of automated abstractive summarization on the proposed
datasets with pre-trained encoder-decoder language models, including BERT2BERT,
T5-large and BART. Further, based on the BART model, we leverage the sampled
summaries from the train set as prior knowledge guidance, for encoding
additional contextual representations of the guidance with the encoder and
enhancing the decoding representations in the decoder. The experimental results
confirm the improvement of ROUGE scores and BERTScore made by the proposed
method, outperforming the larger model T5-large.
Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang
2023-02-08