ArXiv Preprint
Objective: The generalizability of clinical large language models is usually
ignored during the model development process. This study evaluated the
generalizability of BERT-based clinical NLP models across different clinical
settings through a breast cancer phenotype extraction task.
Materials and Methods: Two clinical corpora of breast cancer patients were
collected from the electronic health records from the University of Minnesota
and the Mayo Clinic, and annotated following the same guideline. We developed
three types of NLP models (i.e., conditional random field, bi-directional long
short-term memory and CancerBERT) to extract cancer phenotypes from clinical
texts. The models were evaluated for their generalizability on different test
sets with different learning strategies (model transfer vs. locally trained).
The entity coverage score was assessed with their association with the model
performances.
Results: We manually annotated 200 and 161 clinical documents at UMN and MC,
respectively. The corpora of the two institutes were found to have higher
similarity between the target entities than the overall corpora. The CancerBERT
models obtained the best performances among the independent test sets from two
clinical institutes and the permutation test set. The CancerBERT model
developed in one institute and further fine-tuned in another institute achieved
reasonable performance compared to the model developed on local data (micro-F1:
0.925 vs 0.932).
Conclusions: The results indicate the CancerBERT model has the best learning
ability and generalizability among the three types of clinical NLP models. The
generalizability of the models was found to be correlated with the similarity
of the target entities between the corpora.
Sicheng Zhou, Nan Wang, Liwei Wang, Ju Sun, Anne Blaes, Hongfang Liu, Rui Zhang
2023-03-15