In Journal of biomedical informatics ; h5-index 55.0
OBJECTIVE : Artificial intelligence in healthcare increasingly relies on relations in knowledge graphs for algorithm development. However, many important relations are not well covered in existing knowledge graphs. We aim to develop a novel long-distance relation extraction algorithm that leverages the article section structure and is trained with bootstrapped noisy data to identify important relations for diagnosis, including may cause, may be caused by, and differential diagnosis.
METHODS : Known relations were extracted from semistructured web pages and a relational database and were paired with sentences containing corresponding medical concepts to form training data. The sentence form was extended to allow one concept to be in the title. An attention mechanism was applied to reduce the effect of noisily labeled sentences. Section structure embedding was added to provide additional context for relation expressions. Graph information was further incorporated into the model to differentiate the target relations whose expressions were often similar and interwoven.
RESULTS : The extended sentence form allowed 1.75 times as many relations and 2.17 times as many sentences to be found compared to the conventional form. The various components of the proposed model all added to the accuracy. Overall, the positive sample accuracy of the proposed model was 9 percentage points higher than baseline deep learning models and 13 percentage points higher than naïve Bayes and support vector machines.
CONCLUSION : Our bootstrap data preparation method and the extended sentence form could form a large training dataset to enable algorithm development and data mining efforts. Section structure embedding and graph information significantly increased prediction accuracy.
Lin Yucong, Li Yang, Lu Keming, Ma Cheng, Zhao Peng, Gao Daiqi, Fan Zihao, Cheng Zijie, Wang Zheyu, Yu Sheng
article structure embedding, distant supervision, graph information, relation extraction