Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Accurate Blood-Based Diagnostic Biosignatures for Alzheimer's Disease via Automated Machine Learning.

In Journal of clinical medicine

Alzheimer's disease (AD) is the most common form of neurodegenerative dementia and its timely diagnosis remains a major challenge in biomarker discovery. In the present study, we analyzed publicly available high-throughput low-sample -omics datasets from studies in AD blood, by the AutoML technology Just Add Data Bio (JADBIO), to construct accurate predictive models for use as diagnostic biosignatures. Considering data from AD patients and age-sex matched cognitively healthy individuals, we produced three best performing diagnostic biosignatures specific for the presence of AD: A. A 506-feature transcriptomic dataset from 48 AD and 22 controls led to a miRNA-based biosignature via Support Vector Machines with three miRNA predictors (AUC 0.975 (0.906, 1.000)), B. A 38,327-feature transcriptomic dataset from 134 AD and 100 controls led to six mRNA-based statistically equivalent signatures via Classification Random Forests with 25 mRNA predictors (AUC 0.846 (0.778, 0.905)) and C. A 9483-feature proteomic dataset from 25 AD and 37 controls led to a protein-based biosignature via Ridge Logistic Regression with seven protein predictors (AUC 0.921 (0.849, 0.972)). These performance metrics were also validated through the JADBIO pipeline confirming stability. In conclusion, using the automated machine learning tool JADBIO, we produced accurate predictive biosignatures extrapolating available low sample -omics data. These results offer options for minimally invasive blood-based diagnostic tests for AD, awaiting clinical validation based on respective laboratory assays. They also highlight the value of AutoML in biomarker discovery.

Karaglani Makrina, Gourlia Krystallia, Tsamardinos Ioannis, Chatzaki Ekaterini


Alzheimer’s disease, blood, classifier, machine learning, predictive model

Pathology Pathology

De-identifying free text of Japanese electronic health records.

In Journal of biomedical semantics ; h5-index 23.0

BACKGROUND : Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset.

RESULTS : Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR.

CONCLUSIONS : Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.

Kajiyama Kohei, Horiguchi Hiromasa, Okumura Takashi, Morita Mizuki, Kano Yoshinobu


De-identification, Electronic health records, Japanese language

General General

Utilizing imbalanced electronic health records to predict acute kidney injury by ensemble learning and time series model.

In BMC medical informatics and decision making ; h5-index 38.0

BACKGROUND : Acute Kidney Injury (AKI) is a shared complication among Intensive Care Unit (ICU), marked by high cost, high morbidity and high mortality. As the early prediction of AKI is critical for patients' outcomes and data mining is such a powerful prediction tool, many AKI prediction models based on machine learning methods have been proposed. Our motivation is inspired by the fact that the incidence of AKI is a changing temporal sequence affected by the joint action of patients' daily drug combinations and their physiological indexes. However, most existing models have not considered such a temporal correlation. Besides, due to great challenges caused by sparse, high-dimensional and highly imbalanced clinical data, it is hard to achieve ideal performance.

METHODS : We develop a fast, simple and less-costly model based on an ensemble learning algorithm, named Ensemble Time Series Model (ETSM). Besides benefiting from vital signs and laboratory results as explicit indicators, ETSM explores the effect of drug combinations as possible implicit indicators for the AKI prediction. The model transforms temporal medication information into a multidimensional vector to consider and measure drug cumulative effects that may cause AKI.

RESULTS : We compare ETSM with state-of-the-art models on ICUC and MIMIC III datasets. On the basis of the experimental results, our model obtains satisfactory performance (ICUC: AUC 24 hours ahead: 0.81, 48 hours ahead: 0.78; MIMIC III: AUC 24 hours ahead: 0.95, 48 hours ahead: 0.95). Meanwhile, we compare the effects of different sampling and feature generation methods on the model performance. In the ablation study, we validate that medication information improves model performance (24 hours ahead: AUC increased from 0.74 to 0.81). We also find that the model's performance is closely related to the balanced level of the derivation dataset. The optimal ratio of major class size to minor class size for the model is found for AKI prediction.

CONCLUSIONS : ETSM is an effective method for the early prediction of AKI. The model verifies that AKI incidence is related to the clinical medication. In comparison with other prediction methods, ETSM provides comparable performance results and better interpretability.

Wang Yuan, Wei Yake, Yang Hao, Li Jingwei, Zhou Yubo, Wu Qin


Acute kidney injury (AKI), Drug combination, ETSM, Ensemble learning, Prediction

General General

Viral pandemic preparedness: A pluripotent stem cell-based machine-learning platform for simulating SARS-CoV-2 infection to enable drug discovery and repurposing.

In Stem cells translational medicine

Infection with the SARS-CoV-2 virus has rapidly become a global pandemic for which we were not prepared. Several clinical trials using previously approved drugs and drug combinations are urgently underway to improve our current situation. Unfortunately, a vaccine option is optimistically at least a year away. It is imperative that for future viral pandemic preparedness, we have a rapid screening technology for drug discovery and repurposing. The primary purpose of this research project was to evaluate the DeepNEU stem-cell based platform by creating and validating computer simulations of artificial lung cells infected with SARS-CoV-2 to enable the rapid identification of antiviral therapeutic targets and drug repurposing. The data generated from this project indicate that (a) human alveolar type lung cells can be simulated by DeepNEU (v5.0), (b) these simulated cells can then be infected with simulated SARS-CoV-2 virus, (c) the unsupervised learning system performed well in all simulations based on available published wet lab data, and (d) the platform identified potentially effective anti-SARS-CoV2 combinations of known drugs for urgent clinical study. The data also suggest that DeepNEU can identify potential therapeutic targets for expedited vaccine development. We conclude that based on published data plus current DeepNEU results, continued development of the DeepNEU platform will improve our preparedness for and response to future viral outbreaks. This can be achieved through rapid identification of potential therapeutic options for clinical testing as soon as the viral genome has been confirmed.

Esmail Sally, Danter Wayne R


DeepNEU, SARS-CoV-2, antiviral, drug discovery and repurposing, pandemic preparedness, unsupervised learning

Public Health Public Health

Transcription factor expression as a predictor of colon cancer prognosis: a machine learning practice.

In BMC medical genomics

BACKGROUND : Colon cancer is one of the leading causes of cancer deaths in the USA and around the world. Molecular level characters, such as gene expression levels and mutations, may provide profound information for precision treatment apart from pathological indicators. Transcription factors function as critical regulators in all aspects of cell life, but transcription factors-based biomarkers for colon cancer prognosis were still rare and necessary.

METHODS : We implemented an innovative process to select the transcription factors variables and evaluate the prognostic prediction power by combining the Cox PH model with the random forest algorithm. We picked five top-ranked transcription factors and built a prediction model by using Cox PH regression. Using Kaplan-Meier analysis, we validated our predictive model on four independent publicly available datasets (GSE39582, GSE17536, GSE37892, and GSE17537) from the GEO database, consisting of 925 colon cancer patients.

RESULTS : A five-transcription-factors based predictive model for colon cancer prognosis has been developed by using TCGA colon cancer patient data. Five transcription factors identified for the predictive model is HOXC9, ZNF556, HEYL, HOXC4 and HOXC6. The prediction power of the model is validated with four GEO datasets consisting of 1584 patient samples. Kaplan-Meier curve and log-rank tests were conducted on both training and validation datasets, the difference of overall survival time between predicted low and high-risk groups can be clearly observed. Gene set enrichment analysis was performed to further investigate the difference between low and high-risk groups in the gene pathway level. The biological meaning was interpreted. Overall, our results prove our prediction model has a strong prediction power on colon cancer prognosis.

CONCLUSIONS : Transcription factors can be used to construct colon cancer prognostic signatures with strong prediction power. The variable selection process used in this study has the potential to be implemented in the prognostic signature discovery of other cancer types. Our five TF-based predictive model would help with understanding the hidden relationship between colon cancer patient survival and transcription factor activities. It will also provide more insights into the precision treatment of colon cancer patients from a genomic information perspective.

Liu Jiannan, Dong Chuanpeng, Jiang Guanglong, Lu Xiaoyu, Liu Yunlong, Wu Huanmei


Cancer prognosis, Colon cancer, Machine learning, Transcription factor

Cardiology Cardiology

A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity.

In PloS one ; h5-index 176.0

Worldwide, testing capacity for SARS-CoV-2 is limited and bottlenecks in the scale up of polymerase chain reaction (PCR-based testing exist. Our aim was to develop and evaluate a machine learning algorithm to diagnose COVID-19 in the inpatient setting. The algorithm was based on basic demographic and laboratory features to serve as a screening tool at hospitals where testing is scarce or unavailable. We used retrospectively collected data from the UCLA Health System in Los Angeles, California. We included all emergency room or inpatient cases receiving SARS-CoV-2 PCR testing who also had a set of ancillary laboratory features (n = 1,455) between 1 March 2020 and 24 May 2020. We tested seven machine learning models and used a combination of those models for the final diagnostic classification. In the test set (n = 392), our combined model had an area under the receiver operator curve of 0.91 (95% confidence interval 0.87-0.96). The model achieved a sensitivity of 0.93 (95% CI 0.85-0.98), specificity of 0.64 (95% CI 0.58-0.69). We found that our machine learning algorithm had excellent diagnostic metrics compared to SARS-CoV-2 PCR. This ensemble machine learning algorithm to diagnose COVID-19 has the potential to be used as a screening tool in hospital settings where PCR testing is scarce or unavailable.

Goodman-Meza David, Rudas Akos, Chiang Jeffrey N, Adamson Paul C, Ebinger Joseph, Sun Nancy, Botting Patrick, Fulcher Jennifer A, Saab Faysal G, Brook Rachel, Eskin Eleazar, An Ulzee, Kordi Misagh, Jew Brandon, Balliu Brunilda, Chen Zeyuan, Hill Brian L, Rahmani Elior, Halperin Eran, Manuel Vladimir