ArXiv Preprint
Objective: We aim to develop an open-source natural language processing (NLP)
package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models
to extract social determinants of health (SDoH) for cancer patients, examine
the generalizability of SODA to a new disease domain (i.e., opioid use), and
evaluate the extraction rate of SDoH using cancer populations.
Methods: We identified SDoH categories and attributes and developed an SDoH
corpus using clinical notes from a general cancer cohort. We compared four
transformer-based NLP models to extract SDoH, examined the generalizability of
NLP models to a cohort of patients prescribed with opioids, and explored
customization strategies to improve performance. We applied the best NLP model
to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804),
and colorectal cancer (n=6,240) cohorts.
Results and Conclusion: We developed a corpus of 629 cancer patients notes
with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH.
The Bidirectional Encoder Representations from Transformers (BERT) model
achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH
concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts.
Fine-tuning the NLP models using new annotations from opioid use patients
improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The
extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH
could be extracted from >70% of cancer patients, but 9 SDoH had a low
extraction rate (<70% of cancer patients). The SODA package with pre-trained
transformer models is publicly available at
https://github.com/uf-hobiinformatics-lab/SDoH_SODA.
Zehao Yu, Xi Yang, Chong Dang, Prakash Adekkanattu, Braja Gopal Patra, Yifan Peng, Jyotishman Pathak, Debbie L. Wilson, Ching-Yuan Chang, Wei-Hsuan Lo-Ciganic, Thomas J. George, William R. Hogan, Yi Guo, Jiang Bian, Yonghui Wu
2022-12-06