Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Integration of human cell lines gene expression and chemical properties of drugs for Drug Induced Liver Injury prediction.

In Biology direct

MOTIVATION : Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI can bring a significant reduction in the cost of clinical trials. In this work we examined whether occurrence of DILI can be predicted using gene expression profile in cancer cell lines and chemical properties of drugs.

METHODS : We used gene expression profiles from 13 human cell lines, as well as molecular properties of drugs to build Machine Learning models of DILI. To this end, we have used a robust cross-validated protocol based on feature selection and Random Forest algorithm. In this protocol we first identify the most informative variables and then use them to build predictive models. The models are first built using data from single cell lines, and chemical properties. Then they are integrated using Super Learner method with several underlying methods for integration. The entire modelling process is performed using nested cross-validation.

RESULTS : We have obtained weakly predictive ML models when using either molecular descriptors, or some individual cell lines (AUC ∈(0.55-0.61)). Models obtained with the Super Learner approach have a significantly improved accuracy (AUC=0.73), which allows to divide substances in two categories: low-risk and high-risk.

Lesiński Wojciech, Mnich Krzysztof, Golińska Agnieszka Kitlas, Rudnicki Witold R


Data integration, Machine learning, Random forest

General General

Explanation and prediction of clinical data with imbalanced class distribution based on pattern discovery and disentanglement.

In BMC medical informatics and decision making ; h5-index 38.0

BACKGROUND : Statistical data analysis, especially the advanced machine learning (ML) methods, have attracted considerable interest in clinical practices. We are looking for interpretability of the diagnostic/prognostic results that will bring confidence to doctors, patients and their relatives in therapeutics and clinical practice. When datasets are imbalanced in diagnostic categories, we notice that the ordinary ML methods might produce results overwhelmed by the majority classes diminishing prediction accuracy. Hence, it needs methods that could produce explicit transparent and interpretable results in decision-making, without sacrificing accuracy, even for data with imbalanced groups.

METHODS : In order to interpret the clinical patterns and conduct diagnostic prediction of patients with high accuracy, we develop a novel method, Pattern Discovery and Disentanglement for Clinical Data Analysis (cPDD), which is able to discover patterns (correlated traits/indicants) and use them to classify clinical data even if the class distribution is imbalanced. In the most general setting, a relational dataset is a large table such that each column represents an attribute (trait/indicant), and each row contains a set of attribute values (AVs) of an entity (patient). Compared to the existing pattern discovery approaches, cPDD can discover a small succinct set of statistically significant high-order patterns from clinical data for interpreting and predicting the disease class of the patients even with groups small and rare.

RESULTS : Experiments on synthetic and thoracic clinical dataset showed that cPDD can 1) discover a smaller set of succinct significant patterns compared to other existing pattern discovery methods; 2) allow the users to interpret succinct sets of patterns coming from uncorrelated sources, even the groups are rare/small; and 3) obtain better performance in prediction compared to other interpretable classification approaches.

CONCLUSIONS : In conclusion, cPDD discovers fewer patterns with greater comprehensive coverage to improve the interpretability of patterns discovered. Experimental results on synthetic data validated that cPDD discovers all patterns implanted in the data, displays them precisely and succinctly with statistical support for interpretation and prediction, a capability which the traditional ML methods lack. The success of cPDD as a novel interpretable method in solving the imbalanced class problem shows its great potential to clinical data analysis for years to come.

Zhou Pei-Yuan, Wong Andrew K C


Clinical decision-making, Disentanglement, Imbalance classification, Pattern discovery

General General

Computational Barthel Index: an automated tool for assessing and predicting activities of daily living among nursing home patients.

In BMC medical informatics and decision making ; h5-index 38.0

BACKGROUND : Assessment of functional ability, including activities of daily living (ADLs), is a manual process completed by skilled health professionals. In the presented research, an automated decision support tool, the Computational Barthel Index Tool (CBIT), was constructed that can automatically assess and predict probabilities of current and future ADLs based on patients' medical history.

METHODS : The data used to construct the tool include the demographic information, inpatient and outpatient diagnosis codes, and reported disabilities of 181,213 residents of the Department of Veterans Affairs' (VA) Community Living Centers. Supervised machine learning methods were applied to construct the CBIT. Temporal information about times from the first and the most recent occurrence of diagnoses was encoded. Ten-fold cross-validation was used to tune hyperparameters, and independent test sets were used to evaluate models using AUC, accuracy, recall and precision. Random forest achieved the best model quality. Models were calibrated using isotonic regression.

RESULTS : The unabridged version of CBIT uses 578 patient characteristics and achieved average AUC of 0.94 (0.93-0.95), accuracy of 0.90 (0.89-0.91), precision of 0.91 (0.89-0.92), and recall of 0.90 (0.84-0.95) when re-evaluating patients. CBIT is also capable of predicting ADLs up to one year ahead, with accuracy decreasing over time, giving average AUC of 0.77 (0.73-0.79), accuracy of 0.73 (0.69-0.80), precision of 0.74 (0.66-0.81), and recall of 0.69 (0.34-0.96). A simplified version of CBIT with 50 top patient characteristics reached performance that does not significantly differ from full CBIT.

CONCLUSION : Discharge planners, disability application reviewers and clinicians evaluating comparative effectiveness of treatments can use CBIT to assess and predict information on functional status of patients.

Wojtusiak Janusz, Asadzadehzanjani Negin, Levy Cari, Alemi Farrokh, Williams Allison E


Activities of daily living, Gerontology, Machine learning, Supervised learning

General General

Data-driven estimates of global litter production imply slower vegetation carbon turnover.

In Global change biology

Accurate quantification of vegetation carbon turnover time (τveg ) is critical for reducing uncertainties in terrestrial vegetation response to future climate change. However, in the absence of global information of litter production, τveg could only be estimated based on net primary productivity under the steady-state assumption. Here, we applied a machine-learning approach to derive a global dataset of litter production by linking 2401 field observations and global environmental drivers. Results suggested that the observation-based estimate of global natural ecosystem litter production was 44.3 ± 0.4 Pg C y-1 . By contrast, land-surface models (LSMs) overestimated the global litter production by about 27%. With this new global litter production dataset, we estimated global τveg (mean value 10.3 ± 1.4 y) and its spatial distribution. Compared to our observation-based τveg , modelled τveg tended to underestimate τveg  at high latitudes. Our empirically derived gridded datasets of litter production and τveg will help constrain global vegetation models and improve the prediction of global carbon cycle.

He Yue, Wang Xuhui, Wang Kai, Tang Shuchang, Xu Hao, Chen Anping, Ciais Philippe, Li Xiangyi, Peñuelas Josep, Piao Shilong


boosted regression trees, land-surface models, litter production, vegetation carbon stock, vegetation carbon turnover time

General General

Satellite data and machine learning reveal the incidence of late frost defoliations on Iberian beech forests.

In Ecological applications : a publication of the Ecological Society of America

Climate warming is driving an advance of leaf unfolding date in temperate deciduous forests, promoting longer growing seasons and higher carbon gains. However, an earlier leaf phenology also increases the risk of late frost defoliation (LFD) events. Compiling the spatio-temporal patterns of defoliations caused by spring frost events is critical to unveil whether the balance between the current advance in leaf unfolding dates and the frequency of LFD occurrence is changing and represents a threaten for the future viability and persistence of deciduous forests. We combined satellite imagery with machine learning techniques to reconstruct the spatio-temporal patterns of LFD events for the 2003-2018 period in the Iberian range of European beech (Fagus sylvatica), at the drier distribution edge of the species. We used MODIS Vegetation Index Products to generate a Normalized Difference Vegetation Index (NDVI) time series for each 250 x 250 m pixel in a total area of 1,013 km2 (16,218 pixels). A semi-supervised approach was used to train a machine learning model, in which a binary classifier called Support Vector Machine with Global Alignment Kernel was used to differentiate between late frost and non-late frost pixels. We verified the obtained estimates with photointerpretation and existing beech tree-ring chronologies to iteratively improve the model. Then, we used the model output to identify topographical and climatic factors that determined the spatial incidence of LFD. During the study period, LFD was a low recurrence phenomenon that occurred every 15.2 years on average and showed high spatio-temporal heterogeneity. Most LFD events were condensed in five years and clustered in western forests (86.5% in one fifth of the pixels) located at high elevation with lower than average precipitation. Elevation and longitude were the major LFD risk factors, followed by annual precipitation. The synergistic effects of increasing drought intensity and rising temperature combined with more frequent late frost events may determine the future performance and distribution of beech forests. This interaction might be critical at the beech drier range edge, where the concentration of LFD at high elevations could constrain beech altitudinal shifts and/or favor species with higher resistance to late frosts.

Olano José M, García-Cervigón Ana I, Sangüesa-Barreda Gabriel, Rozas Vicente, Muñoz-Garachana Diego, García-Hidalgo Miguel, García-Pedrero Ángel


\nFagus sylvatica\n, Climate warming, MODIS, Normalized Difference Vegetation Index (NDVI), extreme climate events, late spring frost defoliation

General General

Screening of sleep apnea based on heart rate variability and long short-term memory.

In Sleep & breathing = Schlaf & Atmung

PURPOSE : Sleep apnea syndrome (SAS) is a prevalent sleep disorder in which apnea and hypopnea occur frequently during sleep and result in increase of the risk of lifestyle-related disease development as well as daytime sleepiness. Although SAS is a common sleep disorder, most patients remain undiagnosed because the gold standard test polysomnography (PSG), is high-cost and unavailable in many hospitals. Thus, an SAS screening system that can be used easily at home is needed.

METHODS : Apnea during sleep affects changes in the autonomic nervous function, which causes fluctuation of the heart rate. In this study, we propose a new SAS screening method that combines heart rate measurement and long short-term memory (LSTM) which is a type of recurrent neural network (RNN). We analyzed the data of intervals between adjacent R waves (R-R interval; RRI) on the electrocardiogram (ECG) records, and used an LSTM model whose inputs are the RRI data is trained to discriminate the respiratory condition during sleep.

RESULTS : The application of the proposed method to clinical data showed that it distinguished between patients with moderate-to-severe SAS with a sensitivity of 100% and specificity of 100%, results which are superior to any other existing SAS screening methods.

CONCLUSION : Since the RRI data can be easily measured by means of wearable heart rate sensors, our method may prove to be useful as an SAS screening system at home.

Iwasaki Ayako, Nakayama Chikao, Fujiwara Koichi, Sumi Yukiyoshi, Matsuo Masahiro, Kano Manabu, Kadotani Hiroshi


Machine learning, Sleep apnea syndrome, Telemedicine, Wearable sensor