Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Review: Application and Prospective Discussion of Machine Learning for the Management of Dairy Farms.

In Animals : an open access journal from MDPI

Dairy farmers use herd management systems, behavioral sensors, feeding lists, breeding schedules, and health records to document herd characteristics. Consequently, large amounts of dairy data are becoming available. However, a lack of data integration makes it difficult for farmers to analyze the data on their dairy farm, which indicates that these data are currently not being used to their full potential. Hence, multiple issues in dairy farming such as low longevity, poor performance, and health issues remain. We aimed to evaluate whether machine learning (ML) methods can solve some of these existing issues in dairy farming. This review summarizes peer-reviewed ML papers published in the dairy sector between 2015 and 2020. Ultimately, 97 papers from the subdomains of management, physiology, reproduction, behavior analysis, and feeding were considered in this review. The results confirm that ML algorithms have become common tools in most areas of dairy research, particularly to predict data. Despite the quantity of research available, most tested algorithms have not performed sufficiently for a reliable implementation in practice. This may be due to poor training data. The availability of data resources from multiple farms covering longer periods would be useful to improve prediction accuracies. In conclusion, ML is a promising tool in dairy research, which could be used to develop and improve decision support for farmers. As the cow is a multifactorial system, ML algorithms could analyze integrated data sources that describe and ultimately allow managing cows according to all relevant influencing factors. However, both the integration of multiple data sources and the obtainability of public data currently remain challenging.

Cockburn Marianne


big data, cluster, data analysis, data integration, sensor, smart farming

Public Health Public Health

Environmental Health Surveillance System for a Population Using Advanced Exposure Assessment.

In Toxics

Human exposure to air pollution is a major public health concern. Environmental policymakers have been implementing various strategies to reduce exposure, including the 10th-day-no-driving system. To assess exposure of an entire population of a community in a highly polluted area, pollutant concentrations in microenvironments and population time-activity patterns are required. To date, population exposure to air pollutants has been assessed using air monitoring data from fixed atmospheric monitoring stations, atmospheric dispersion modeling, or spatial interpolation techniques for pollutant concentrations. This is coupled with census data, administrative registers, and data on the patterns of the time-based activities at the individual scale. Recent technologies such as sensors, the Internet of Things (IoT), communications technology, and artificial intelligence enable the accurate evaluation of air pollution exposure for a population in an environmental health context. In this study, the latest trends in published papers on the assessment of population exposure to air pollution were reviewed. Subsequently, this study proposes a methodology that will enable policymakers to develop an environmental health surveillance system that evaluates the distribution of air pollution exposure for a population within a target area and establish countermeasures based on advanced exposure assessment.

Yang Wonho, Park Jinhyeon, Cho Mansu, Lee Cheolmin, Lee Jeongil, Lee Chaekwan


air pollution, environmental health surveillance system, exposure assessment, population exposure

General General

A Statistical Analysis of Risk Factors and Biological Behavior in Canine Mammary Tumors: A Multicenter Study.

In Animals : an open access journal from MDPI

Canine mammary tumors (CMTs) represent a serious issue in worldwide veterinary practice and several risk factors are variably implicated in the biology of CMTs. The present study examines the relationship between risk factors and histological diagnosis of a large CMT dataset from three academic institutions by classical statistical analysis and supervised machine learning methods. Epidemiological, clinical, and histopathological data of 1866 CMTs were included. Dogs with malignant tumors were significantly older than dogs with benign tumors (9.6 versus 8.7 years, P < 0.001). Malignant tumors were significantly larger than benign counterparts (2.69 versus 1.7 cm, P < 0.001). Interestingly, 18% of malignant tumors were smaller than 1 cm in diameter, providing compelling evidence that the size of the tumor should be reconsidered during the assessment of the TNM-WHO clinical staging. The application of the logistic regression and the machine learning model identified the age and the tumor's size as the best predictors with an overall diagnostic accuracy of 0.63, suggesting that these risk factors are sufficient but not exhaustive indicators of the malignancy of CMTs. This multicenter study increases the general knowledge of the main epidemiologica-clinical risk factors involved in the onset of CMTs and paves the way for further investigations of these factors in association with CMTs and in the application of machine learning technology.

Burrai Giovanni P, Gabrieli Andrea, Moccia Valentina, Zappulli Valentina, Porcellato Ilaria, Brachelente Chiara, Pirino Salvatore, Polinas Marta, Antuofermo Elisabetta


age, breed, dogs, machine learning, mammary tumor size, reproductive and hormonal status

General General

Data-Driven Molecular Dynamics: A Multifaceted Challenge.

In Pharmaceuticals (Basel, Switzerland)

The big data concept is currently revolutionizing several fields of science including drug discovery and development. While opening up new perspectives for better drug design and related strategies, big data analysis strongly challenges our current ability to manage and exploit an extraordinarily large and possibly diverse amount of information. The recent renewal of machine learning (ML)-based algorithms is key in providing the proper framework for addressing this issue. In this respect, the impact on the exploitation of molecular dynamics (MD) simulations, which have recently reached mainstream status in computational drug discovery, can be remarkable. Here, we review the recent progress in the use of ML methods coupled to biomolecular simulations with potentially relevant implications for drug design. Specifically, we show how different ML-based strategies can be applied to the outcome of MD simulations for gaining knowledge and enhancing sampling. Finally, we discuss how intrinsic limitations of MD in accurately modeling biomolecular systems can be alleviated by including information coming from experimental data.

Bernetti Mattia, Bertazzo Martina, Masetti Matteo


Markov state models, collective variables, dimensionality reduction, experimental data, machine learning, maximum entropy principle, reaction coordinates

Internal Medicine Internal Medicine

Employing computational linguistics techniques to identify limited patient health literacy: Findings from the ECLIPPSE study.

In Health services research

OBJECTIVE : To develop novel, scalable, and valid literacy profiles for identifying limited health literacy patients by harnessing natural language processing.

DATA SOURCE : With respect to the linguistic content, we analyzed 283 216 secure messages sent by 6941 diabetes patients to physicians within an integrated system's electronic portal. Sociodemographic, clinical, and utilization data were obtained via questionnaire and electronic health records.

STUDY DESIGN : Retrospective study used natural language processing and machine learning to generate five unique "Literacy Profiles" by employing various sets of linguistic indices: Flesch-Kincaid (LP_FK); basic indices of writing complexity, including lexical diversity (LP_LD) and writing quality (LP_WQ); and advanced indices related to syntactic complexity, lexical sophistication, and diversity, modeled from self-reported (LP_SR), and expert-rated (LP_Exp) health literacy. We first determined the performance of each literacy profile relative to self-reported and expert-rated health literacy to discriminate between high and low health literacy and then assessed Literacy Profiles' relationships with known correlates of health literacy, such as patient sociodemographics and a range of health-related outcomes, including ratings of physician communication, medication adherence, diabetes control, comorbidities, and utilization.

PRINCIPAL FINDINGS : LP_SR and LP_Exp performed best in discriminating between high and low self-reported (C-statistics: 0.86 and 0.58, respectively) and expert-rated health literacy (C-statistics: 0.71 and 0.87, respectively) and were significantly associated with educational attainment, race/ethnicity, Consumer Assessment of Provider and Systems (CAHPS) scores, adherence, glycemia, comorbidities, and emergency department visits.

CONCLUSIONS : Since health literacy is a potentially remediable explanatory factor in health care disparities, the development of automated health literacy indicators represents a significant accomplishment with broad clinical and population health applications. Health systems could apply literacy profiles to efficiently determine whether quality of care and outcomes vary by patient health literacy; identify at-risk populations for targeting tailored health communications and self-management support interventions; and inform clinicians to promote improvements in individual-level care.

Schillinger Dean, Balyan Renu, Crossley Scott A, McNamara Danielle S, Liu Jennifer Y, Karter Andrew J


communication, diabetes, health literacy, machine learning, managed care, natural language processing, secure messaging

General General

Knockoff Boosted Tree for Model-Free Variable Selection.

In Bioinformatics (Oxford, England)

MOTIVATION : The recently proposed knockoff filter is a general framework for controlling the false discovery rate when performing variable selection. This powerful new approach generates a "knockoff" of each variable tested for exact false discovery rate control. Imitation variables that mimic the correlation structure found within the original variables serve as negative controls for statistical inference. Current applications of knockoff methods use linear regression models and conduct variable selection only for variables existing in model functions. Here, we extend the use of knockoffs for machine learning with boosted trees, which are successful and widely used in problems where no prior knowledge of model function is required. However, currently available importance scores in tree models are insufficient for variable selection with false discovery rate control.

RESULTS : We propose a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. We extend the current knockoff method to model-free variable selection through the use of tree-based models. Additionally, we propose and evaluate two new sampling methods for generating knockoffs, namely the sparse covariance and principal component knockoff methods. We test and compare these methods with the original knockoff method regarding their ability to control type I errors and power. In simulation tests, we compare the properties and performance of importance test statistics of tree models. The results include different combinations of knockoffs and importance test statistics. We consider scenarios that include main-effect, interaction, exponential, and second-order models while assuming the true model structures are unknown. We apply our algorithm for tumor purity estimation and tumor classification using Cancer Genome Atlas (TCGA) gene expression data. Our results show improved discrimination between difficult-to-discriminate cancer types.

AVAILABILITY AND IMPLEMENTATION : The proposed algorithm is included in the KOBT package, which is available at

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Jiang Tao, Li Yuanyuan, Motsinger-Reif Alison A