Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

A Statistical Analysis of Risk Factors and Biological Behavior in Canine Mammary Tumors: A Multicenter Study.

In Animals : an open access journal from MDPI

Canine mammary tumors (CMTs) represent a serious issue in worldwide veterinary practice and several risk factors are variably implicated in the biology of CMTs. The present study examines the relationship between risk factors and histological diagnosis of a large CMT dataset from three academic institutions by classical statistical analysis and supervised machine learning methods. Epidemiological, clinical, and histopathological data of 1866 CMTs were included. Dogs with malignant tumors were significantly older than dogs with benign tumors (9.6 versus 8.7 years, P < 0.001). Malignant tumors were significantly larger than benign counterparts (2.69 versus 1.7 cm, P < 0.001). Interestingly, 18% of malignant tumors were smaller than 1 cm in diameter, providing compelling evidence that the size of the tumor should be reconsidered during the assessment of the TNM-WHO clinical staging. The application of the logistic regression and the machine learning model identified the age and the tumor's size as the best predictors with an overall diagnostic accuracy of 0.63, suggesting that these risk factors are sufficient but not exhaustive indicators of the malignancy of CMTs. This multicenter study increases the general knowledge of the main epidemiologica-clinical risk factors involved in the onset of CMTs and paves the way for further investigations of these factors in association with CMTs and in the application of machine learning technology.

Burrai Giovanni P, Gabrieli Andrea, Moccia Valentina, Zappulli Valentina, Porcellato Ilaria, Brachelente Chiara, Pirino Salvatore, Polinas Marta, Antuofermo Elisabetta


age, breed, dogs, machine learning, mammary tumor size, reproductive and hormonal status

General General

Data-Driven Molecular Dynamics: A Multifaceted Challenge.

In Pharmaceuticals (Basel, Switzerland)

The big data concept is currently revolutionizing several fields of science including drug discovery and development. While opening up new perspectives for better drug design and related strategies, big data analysis strongly challenges our current ability to manage and exploit an extraordinarily large and possibly diverse amount of information. The recent renewal of machine learning (ML)-based algorithms is key in providing the proper framework for addressing this issue. In this respect, the impact on the exploitation of molecular dynamics (MD) simulations, which have recently reached mainstream status in computational drug discovery, can be remarkable. Here, we review the recent progress in the use of ML methods coupled to biomolecular simulations with potentially relevant implications for drug design. Specifically, we show how different ML-based strategies can be applied to the outcome of MD simulations for gaining knowledge and enhancing sampling. Finally, we discuss how intrinsic limitations of MD in accurately modeling biomolecular systems can be alleviated by including information coming from experimental data.

Bernetti Mattia, Bertazzo Martina, Masetti Matteo


Markov state models, collective variables, dimensionality reduction, experimental data, machine learning, maximum entropy principle, reaction coordinates

Internal Medicine Internal Medicine

Employing computational linguistics techniques to identify limited patient health literacy: Findings from the ECLIPPSE study.

In Health services research

OBJECTIVE : To develop novel, scalable, and valid literacy profiles for identifying limited health literacy patients by harnessing natural language processing.

DATA SOURCE : With respect to the linguistic content, we analyzed 283 216 secure messages sent by 6941 diabetes patients to physicians within an integrated system's electronic portal. Sociodemographic, clinical, and utilization data were obtained via questionnaire and electronic health records.

STUDY DESIGN : Retrospective study used natural language processing and machine learning to generate five unique "Literacy Profiles" by employing various sets of linguistic indices: Flesch-Kincaid (LP_FK); basic indices of writing complexity, including lexical diversity (LP_LD) and writing quality (LP_WQ); and advanced indices related to syntactic complexity, lexical sophistication, and diversity, modeled from self-reported (LP_SR), and expert-rated (LP_Exp) health literacy. We first determined the performance of each literacy profile relative to self-reported and expert-rated health literacy to discriminate between high and low health literacy and then assessed Literacy Profiles' relationships with known correlates of health literacy, such as patient sociodemographics and a range of health-related outcomes, including ratings of physician communication, medication adherence, diabetes control, comorbidities, and utilization.

PRINCIPAL FINDINGS : LP_SR and LP_Exp performed best in discriminating between high and low self-reported (C-statistics: 0.86 and 0.58, respectively) and expert-rated health literacy (C-statistics: 0.71 and 0.87, respectively) and were significantly associated with educational attainment, race/ethnicity, Consumer Assessment of Provider and Systems (CAHPS) scores, adherence, glycemia, comorbidities, and emergency department visits.

CONCLUSIONS : Since health literacy is a potentially remediable explanatory factor in health care disparities, the development of automated health literacy indicators represents a significant accomplishment with broad clinical and population health applications. Health systems could apply literacy profiles to efficiently determine whether quality of care and outcomes vary by patient health literacy; identify at-risk populations for targeting tailored health communications and self-management support interventions; and inform clinicians to promote improvements in individual-level care.

Schillinger Dean, Balyan Renu, Crossley Scott A, McNamara Danielle S, Liu Jennifer Y, Karter Andrew J


communication, diabetes, health literacy, machine learning, managed care, natural language processing, secure messaging

General General

Knockoff Boosted Tree for Model-Free Variable Selection.

In Bioinformatics (Oxford, England)

MOTIVATION : The recently proposed knockoff filter is a general framework for controlling the false discovery rate when performing variable selection. This powerful new approach generates a "knockoff" of each variable tested for exact false discovery rate control. Imitation variables that mimic the correlation structure found within the original variables serve as negative controls for statistical inference. Current applications of knockoff methods use linear regression models and conduct variable selection only for variables existing in model functions. Here, we extend the use of knockoffs for machine learning with boosted trees, which are successful and widely used in problems where no prior knowledge of model function is required. However, currently available importance scores in tree models are insufficient for variable selection with false discovery rate control.

RESULTS : We propose a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. We extend the current knockoff method to model-free variable selection through the use of tree-based models. Additionally, we propose and evaluate two new sampling methods for generating knockoffs, namely the sparse covariance and principal component knockoff methods. We test and compare these methods with the original knockoff method regarding their ability to control type I errors and power. In simulation tests, we compare the properties and performance of importance test statistics of tree models. The results include different combinations of knockoffs and importance test statistics. We consider scenarios that include main-effect, interaction, exponential, and second-order models while assuming the true model structures are unknown. We apply our algorithm for tumor purity estimation and tumor classification using Cancer Genome Atlas (TCGA) gene expression data. Our results show improved discrimination between difficult-to-discriminate cancer types.

AVAILABILITY AND IMPLEMENTATION : The proposed algorithm is included in the KOBT package, which is available at

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Jiang Tao, Li Yuanyuan, Motsinger-Reif Alison A


General General

Establishing the accuracy of density functional approaches for the description of noncovalent interactions in biomolecules.

In Physical chemistry chemical physics : PCCP

Biomolecules have complex structures, and noncovalent interactions are crucial to determine their conformations and functionalities. It is therefore critical to be able to describe them in an accurate but efficient manner in these systems. In this context density functional theory (DFT) could provide a powerful tool to simulate biological matter either directly for relatively simple systems or coupled with classical simulations like the QM/MM (quantum mechanics/molecular mechanics) approach. Additionally, DFT could play a fundamental role to fit the parameters of classical force fields or to train machine learning potentials to perform large scale molecular dynamics simulations of biological systems. Yet, local or semi-local approximations used in DFT cannot describe van der Waals (vdW) interactions, one of the essential noncovalent interactions in biomolecules, since they lack a proper description of long range correlation effects. However, many efficient and reasonably accurate methods are now available for the description of van der Waals interactions within DFT. In this work, we establish the accuracy of several state-of-the-art vdW-aware functionals by considering 275 biomolecules including interacting DNA and RNA bases, peptides and biological inhibitors and compare our results for the energy with highly accurate wavefunction based calculations. Most methods considered here can achieve close to predictive accuracy. In particular, the non-local vdW-DF2 functional is revealed to be the best performer for biomolecules, while among the vdW-corrected DFT methods, uMBD is also recommended as a less accurate but faster alternative.

Kim Minho, Gould Tim, Rocca Dario, Lebègue Sébastien


Radiology Radiology

A Quality Control System for Automated Prostate Segmentation on T2-Weighted MRI.

In Diagnostics (Basel, Switzerland)

Computer-aided detection and diagnosis (CAD) systems have the potential to improve robustness and efficiency compared to traditional radiological reading of magnetic resonance imaging (MRI). Fully automated segmentation of the prostate is a crucial step of CAD for prostate cancer, but visual inspection is still required to detect poorly segmented cases. The aim of this work was therefore to establish a fully automated quality control (QC) system for prostate segmentation based on T2-weighted MRI. Four different deep learning-based segmentation methods were used to segment the prostate for 585 patients. First order, shape and textural radiomics features were extracted from the segmented prostate masks. A reference quality score (QS) was calculated for each automated segmentation in comparison to a manual segmentation. A least absolute shrinkage and selection operator (LASSO) was trained and optimized on a randomly assigned training dataset (N = 1756, 439 cases from each segmentation method) to build a generalizable linear regression model based on the radiomics features that best estimated the reference QS. Subsequently, the model was used to estimate the QSs for an independent testing dataset (N = 584, 146 cases from each segmentation method). The mean ± standard deviation absolute error between the estimated and reference QSs was 5.47 ± 6.33 on a scale from 0 to 100. In addition, we found a strong correlation between the estimated and reference QSs (rho = 0.70). In conclusion, we developed an automated QC system that may be helpful for evaluating the quality of automated prostate segmentations.

Sunoqrot Mohammed R S, Selnæs Kirsten M, Sandsmark Elise, Nketiah Gabriel A, Zavala-Romero Olmo, Stoyanova Radka, Bathen Tone F, Elschot Mattijs


MRI, computer-aided detection and diagnosis, deep learning, machine learning, prostate, quality control, radiomics, segmentation