Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

Radiology Radiology

Liver imaging features by convolutional neural network to predict the metachronous liver metastasis in stage I-III colorectal cancer patients based on preoperative abdominal CT scan.

In BMC bioinformatics

BACKGROUND : Introducing deep learning approach to medical images has rendered a large amount of un-decoded information into usage in clinical research. But mostly, it has been focusing on the performance of the prediction modeling for disease-related entity, but not on the clinical implication of the feature itself. Here we analyzed liver imaging features of abdominal CT images collected from 2019 patients with stage I - III colorectal cancer (CRC) using convolutional neural network (CNN) to elucidate its clinical implication in oncological perspectives.

RESULTS : CNN generated imaging features from the liver parenchyma. Dimension reduction was done for the features by principal component analysis. We designed multiple prediction models for 5-year metachronous liver metastasis (5YLM) using combinations of clinical variables (age, sex, T stage, N stage) and top principal components (PCs), with logistic regression classification. The model using "1st PC (PC1) + clinical information" had the highest performance (mean AUC = 0.747) to predict 5YLM, compared to the model with clinical features alone (mean AUC = 0.709). The PC1 was independently associated with 5YLM in multivariate analysis (beta = - 3.831, P < 0.001). For the 5-year mortality rate, PC1 did not contribute to an improvement to the model with clinical features alone. For the PC1, Kaplan-Meier plots showed a significant difference between PC1 low vs. high group. The 5YLM-free survival of low PC1 was 89.6% and the high PC1 was 95.9%. In addition, PC1 had a significant correlation with sex, body mass index, alcohol consumption, and fatty liver status.

CONCLUSION : The imaging features combined with clinical information improved the performance compared to the standardized prediction model using only clinical information. The liver imaging features generated by CNN may have the potential to predict liver metastasis. These results suggest that even though there were no liver metastasis during the primary colectomy, the features of liver imaging can impose characteristics that could be predictive for metachronous liver metastasis.

Lee Sangwoo, Choe Eun Kyung, Kim So Yeon, Kim Hua Sun, Park Kyu Joo, Kim Dokyoon


Artificial intelligence, Colorectal cancer, Convolutional neural network, Radiomics

General General

CORENup: a combination of convolutional and recurrent deep neural networks for nucleosome positioning identification.

In BMC bioinformatics

BACKGROUND : Nucleosomes wrap the DNA into the nucleus of the Eukaryote cell and regulate its transcription phase. Several studies indicate that nucleosomes are determined by the combined effects of several factors, including DNA sequence organization. Interestingly, the identification of nucleosomes on a genomic scale has been successfully performed by computational methods using DNA sequence as input data.

RESULTS : In this work, we propose CORENup, a deep learning model for nucleosome identification. CORENup processes a DNA sequence as input using one-hot representation and combines in a parallel fashion a fully convolutional neural network and a recurrent layer. These two parallel levels are devoted to catching both non periodic and periodic DNA string features. A dense layer is devoted to their combination to give a final classification.

CONCLUSIONS : Results computed on public data sets of different organisms show that CORENup is a state of the art methodology for nucleosome positioning identification based on a Deep Neural Network architecture. The comparisons have been carried out using two groups of datasets, currently adopted by the best performing methods, and CORENup has shown top performance both in terms of classification metrics and elapsed computation time.

Amato Domenico, Bosco Giosue’ Lo, Rizzo Riccardo


Deep learning networks, Epigenetic, Nucleosome classification, Recurrent neural networks

General General

Computationally identifying hot spots in protein-DNA binding interfaces using an ensemble approach.

In BMC bioinformatics

BACKGROUND : Protein-DNA interaction governs a large number of cellular processes, and it can be altered by a small fraction of interface residues, i.e., the so-called hot spots, which account for most of the interface binding free energy. Accurate prediction of hot spots is critical to understand the principle of protein-DNA interactions. There are already some computational methods that can accurately and efficiently predict a large number of hot residues. However, the insufficiency of experimentally validated hot-spot residues in protein-DNA complexes and the low diversity of the employed features limit the performance of existing methods.

RESULTS : Here, we report a new computational method for effectively predicting hot spots in protein-DNA binding interfaces. This method, called PreHots (the abbreviation of Predicting Hotspots), adopts an ensemble stacking classifier that integrates different machine learning classifiers to generate a robust model with 19 features selected by a sequential backward feature selection algorithm. To this end, we constructed two new and reliable datasets (one benchmark for model training and one independent dataset for validation), which totally consist of 123 hot spots and 137 non-hot spots from 89 protein-DNA complexes. The data were manually collected from the literature and existing databases with a strict process of redundancy removal. Our method achieves a sensitivity of 0.813 and an AUC score of 0.868 in 10-fold cross-validation on the benchmark dataset, and a sensitivity of 0.818 and an AUC score of 0.820 on the independent test dataset. The results show that our approach outperforms the existing ones.

CONCLUSIONS : PreHots, which is based on stack ensemble of boosting algorithms, can reliably predict hot spots at the protein-DNA binding interface on a large scale. Compared with the existing methods, PreHots can achieve better prediction performance. Both the webserver of PreHots and the datasets are freely available at: .

Pan Yuliang, Zhou Shuigeng, Guan Jihong


Ensemble stacking classifier, Feature selection, Hot spots, Protein-DNA complexes

General General

Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis.

In BMC medical informatics and decision making ; h5-index 38.0

BACKGROUND : Breast cancer is the most prevalent and among the most deadly cancers in females. Patients with breast cancer have highly variable survival lengths, indicating a need to identify prognostic biomarkers for personalized diagnosis and treatment. With the development of new technologies such as next-generation sequencing, multi-omics information are becoming available for a more thorough evaluation of a patient's condition. In this study, we aim to improve breast cancer overall survival prediction by integrating multi-omics data (e.g., gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs)).

METHODS : Motivated by multi-view learning, we propose a novel strategy to integrate multi-omics data for breast cancer survival prediction by applying complementary and consensus principles. The complementary principle assumes each -omics data contains modality-unique information. To preserve such information, we develop a concatenation autoencoder (ConcatAE) that concatenates the hidden features learned from each modality for integration. The consensus principle assumes that the disagreements among modalities upper bound the model errors. To get rid of the noises or discrepancies among modalities, we develop a cross-modality autoencoder (CrossAE) to maximize the agreement among modalities to achieve a modality-invariant representation. We first validate the effectiveness of our proposed models on the MNIST simulated data. We then apply these models to the TCCA breast cancer multi-omics data for overall survival prediction.

RESULTS : For breast cancer overall survival prediction, the integration of DNA methylation and miRNA expression achieves the best overall performance of 0.641 ± 0.031 with ConcatAE, and 0.63 ± 0.081 with CrossAE. Both strategies outperform baseline single-modality models using only DNA methylation (0.583 ± 0.058) or miRNA expression (0.616 ± 0.057).

CONCLUSIONS : In conclusion, we achieve improved overall survival prediction performance by utilizing either the complementary or consensus information among multi-omics data. The proposed ConcatAE and CrossAE models can inspire future deep representation-based multi-omics integration techniques. We believe these novel multi-omics integration models can benefit the personalized diagnosis and treatment of breast cancer patients.

Tong Li, Mitchel Jonathan, Chatlin Kevin, Wang May D


Breast Cancer, Deep learning, Multi-omics integration, Survival analysis

Public Health Public Health

Gradient boosting for Parkinson's disease diagnosis from voice recordings.

In BMC medical informatics and decision making ; h5-index 38.0

BACKGROUND : Parkinson's Disease (PD) is a clinically diagnosed neurodegenerative disorder that affects both motor and non-motor neural circuits. Speech deterioration (hypokinetic dysarthria) is a common symptom, which often presents early in the disease course. Machine learning can help movement disorders specialists improve their diagnostic accuracy using non-invasive and inexpensive voice recordings.

METHOD : We used "Parkinson Dataset with Replicated Acoustic Features Data Set" from the UCI-Machine Learning repository. The dataset included 44 speech-test based acoustic features from patients with PD and controls. We analyzed the data using various machine learning algorithms including Light and Extreme Gradient Boosting, Random Forest, Support Vector Machines, K-nearest neighborhood, Least Absolute Shrinkage and Selection Operator Regression, as well as logistic regression. We also implemented a variable importance analysis to identify important variables classifying patients with PD.

RESULTS : The cohort included a total of 80 subjects: 40 patients with PD (55% men) and 40 controls (67.5% men). Disease duration was 5 years or less for all subjects, with a mean Unified Parkinson's Disease Rating Scale (UPDRS) score of 19.6 (SD 8.1), and none were taking PD medication. The mean age for PD subjects and controls was 69.6 (SD 7.8) and 66.4 (SD 8.4), respectively. Our best-performing model used Light Gradient Boosting to provide an AUC of 0.951 with 95% confidence interval 0.946-0.955 in 4-fold cross validation using only seven acoustic features.

CONCLUSIONS : Machine learning can accurately detect Parkinson's disease using an inexpensive and non-invasive voice recording. Light Gradient Boosting outperformed other machine learning algorithms. Such approaches could be used to inexpensively screen large patient populations for Parkinson's disease.

Karabayir Ibrahim, Goldman Samuel M, Pappu Suguna, Akbilgic Oguz


Artificial intelligence, Gradient boosting, Machine learning, Parkinson’s disease, Speech test

General General

Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent.

In BMC bioinformatics

BACKGROUND : Statistical analyses of biological problems in life sciences often lead to high-dimensional linear models. To solve the corresponding system of equations, penalization approaches are often the methods of choice. They are especially useful in case of multicollinearity, which appears if the number of explanatory variables exceeds the number of observations or for some biological reason. Then, the model goodness of fit is penalized by some suitable function of interest. Prominent examples are the lasso, group lasso and sparse-group lasso. Here, we offer a fast and numerically cheap implementation of these operators via proximal gradient descent. The grid search for the penalty parameter is realized by warm starts. The step size between consecutive iterations is determined with backtracking line search. Finally, seagull -the R package presented here- produces complete regularization paths.

RESULTS : Publicly available high-dimensional methylation data are used to compare seagull to the established R package SGL. The results of both packages enabled a precise prediction of biological age from DNA methylation status. But even though the results of seagull and SGL were very similar (R2 > 0.99), seagull computed the solution in a fraction of the time needed by SGL. Additionally, seagull enables the incorporation of weights for each penalized feature.

CONCLUSIONS : The following operators for linear regression models are available in seagull: lasso, group lasso, sparse-group lasso and Integrative LASSO with Penalty Factors (IPF-lasso). Thus, seagull is a convenient envelope of lasso variants.

Klosa Jan, Simon Noah, Westermark Pål Olof, Liebscher Volkmar, Wittenburg Dörte


High-dimensional data, Machine learning, Optimization, R package