Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

In Methods of information in medicine

OBJECTIVE :  This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.

MATERIALS AND METHODS :  Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms-namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms-to identify the one that could generate the best prediction model.

RESULTS :  The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.

CONCLUSION :  By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.

Oda Takuma, Chiu Shih-Wei, Yamaguchi Takuhiro

2021-Jul-08