Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

Public Health Public Health


In The annals of applied statistics

Bayesian Additive Regression Trees (BART) is a flexible machine learning algorithm capable of capturing nonlinearities between an outcome and covariates and interactions among covariates. We extend BART to a semiparametric regression framework in which the conditional expectation of an outcome is a function of treatment, its effect modifiers, and confounders. The confounders are allowed to have unspecified functional form, while treatment and effect modifiers that are directly related to the research question are given a linear form. The result is a Bayesian semiparametric linear regression model where the posterior distribution of the parameters of the linear part can be interpreted as in parametric Bayesian regression. This is useful in situations where a subset of the variables are of substantive interest and the others are nuisance variables that we would like to control for. An example of this occurs in causal modeling with the structural mean model (SMM). Under certain causal assumptions, our method can be used as a Bayesian SMM. Our methods are demonstrated with simulation studies and an application to dataset involving adults with HIV/Hepatitis C coinfection who newly initiate antiretroviral therapy. The methods are available in an R package called semibart.

Zeldow Bret, Lo Re Vincent, Roy Jason


Bayesian Additive Regression Trees, antiretrovirals, structural mean model

General General

Deep Learning for Acute Myeloid Leukemia Diagnosis.

In Journal of medicine and life

By changing the lifestyle and increasing the cancer incidence, accurate diagnosis becomes a significant medical action. Today, DNA microarray is widely used in cancer diagnosis and screening since it is able to measure gene expression levels. Analyzing them by using common statistical methods is not suitable because of the high gene expression data dimensions. So, this study aims to use new techniques to diagnose acute myeloid leukemia. In this study, the leukemia microarray gene data, contenting 22283 genes, was extracted from the Gene Expression Omnibus repository. Initial preprocessing was applied by using a normalization test and principal component analysis in Python. Then DNNs neural network designed and implemented to the data and finally results cross-validated by classifiers. The normalization test was significant (P>0.05) and the results show the PCA gene segregation potential and independence of cancer and healthy cells. The results accuracy for single-layer neural network and DNNs deep learning network with three hidden layers are 63.33 and 96.67, respectively. Using new methods such as deep learning can improve diagnosis accuracy and performance compared to the old methods. It is recommended to use these methods in cancer diagnosis and effective gene selection in various types of cancer.

Nazari Elham, Farzin Amir Hossein, Aghemiri Mehran, Avan Amir, Tara Mahmood, Tabesh Hamed

AML, deep learning, machine learning, microarray, neural network

General General

Designing low-cost, accurate cervical screening strategies that take into account COVID-19: a role for self-sampled HPV typing2.

In Infectious agents and cancer

Background : We propose an economical cervical screening research and implementation strategy designed to take into account the typically slow natural history of cervical cancer and the severe but hopefully temporary impact of COVID-19. The commentary introduces the practical validation of some critical components of the strategy, described in three manuscripts detailing recent project results in Asia and Africa.The main phases of a cervical screening program are 1) primary screening of women in the general population, 2) triage testing of the small minority of women that screen positive to determine need for treatment, and 3) treatment of triage-positive women thought to be at highest risk of precancer or even cancer. In each phase, attention must now be paid to safety in relation to SARS-CoV-2 transmission. The new imperatives of the COVID-19 pandemic support self-sampled HPV testing as the primary cervical screening method. Most women can be reassured for several years by a negative test performed on a self-sample collected at home, without need of clinic visit and speculum examination. The advent of relatively inexpensive, rapid and accurate HPV DNA testing makes it possible to return screening results from self-sampling very soon after specimen collection, minimizing loss to follow-up. Partial HPV typing provides important risk stratification useful for triage of HPV-positive women. A second "triage" test is often useful to guide management. In lower-resource settings, visual inspection with acetic acid (VIA) is still proposed but it is inaccurate and poorly reproducible, misclassifying the risk stratification gained by primary HPV testing. A deep-learning based approach to recognizing cervical precancer, adaptable to a smartphone camera, is being validated to improve VIA performance. The advent and approval of thermal ablation permits quick, affordable and safe, immediate treatment at the triage clinic of the majority of HPV-positive, triage-positive women.

Conclusions : Overall, only a small percentage of women in cervical screening programs need to attend the hospital clinic for a surgical procedure, particularly when screening is targeted to the optimal age range for detection of precancer rather than older ages with decreased visual screening performance and higher risks of hard-to-treat outcomes including invasive cancer.

Ajenifuja Kayode Olusegun, Belinson Jerome, Goldstein Andrew, Desai Kanan T, de Sanjose Silvia, Schiffman Mark


COVID-19, Cervical screening, HPV, Self-sampling, Triage

General General

Automated discretization of 'transpiration restriction to increasing VPD' features from outdoors high-throughput phenotyping data.

In Plant methods

Background : Restricting transpiration under high vapor pressure deficit (VPD) is a promising water-saving trait for drought adaptation. However, it is often measured under controlled conditions and at very low throughput, unsuitable for breeding. A few high-throughput phenotyping (HTP) studies exist, and have considered only maximum transpiration rate in analyzing genotypic differences in this trait. Further, no study has precisely identified the VPD breakpoints where genotypes restrict transpiration under natural conditions. Therefore, outdoors HTP data (15 min frequency) of a chickpea population were used to automate the generation of smooth transpiration profiles, extract informative features of the transpiration response to VPD for optimal genotypic discretization, identify VPD breakpoints, and compare genotypes.

Results : Fifteen biologically relevant features were extracted from the transpiration rate profiles derived from load cells data. Genotypes were clustered (C1, C2, C3) and 6 most important features (with heritability > 0.5) were selected using unsupervised Random Forest. All the wild relatives were found in C1, while C2 and C3 mostly comprised high TE and low TE lines, respectively. Assessment of the distinct p-value groups within each selected feature revealed highest genotypic variation for the feature representing transpiration response to high VPD condition. Sensitivity analysis on a multi-output neural network model (with R of 0.931, 0.944, 0.953 for C1, C2, C3, respectively) found C1 with the highest water saving ability, that restricted transpiration at relatively low VPD levels, 56% (i.e. 3.52 kPa) or 62% (i.e. 3.90 kPa), depending whether the influence of other environmental variables was minimum or maximum. Also, VPD appeared to have the most striking influence on the transpiration response independently of other environment variable, whereas light, temperature, and relative humidity alone had little/no effect.

Conclusion : Through this study, we present a novel approach to identifying genotypes with drought-tolerance potential, which overcomes the challenges in HTP of the water-saving trait. The six selected features served as proxy phenotypes for reliable genotypic discretization. The wild chickpeas were found to limit water-loss faster than the water-profligate cultivated ones. Such an analytic approach can be directly used for prescriptive breeding applications, applied to other traits, and help expedite maximized information extraction from HTP data.

Kar Soumyashree, Tanaka Ryokei, Korbu Lijalem Balcha, Kholová Jana, Iwata Hiroyoshi, Durbha Surya S, Adinarayana J, Vadez Vincent


Feature selection, Gini index, High throughput phenotyping, Machine learning, Neural network, Sensitivity analysis, Time series, Transpiration rate, Unsupervised random-forest, Vapor pressure deficit

General General

T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm.

In Frontiers in microbiology

Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at

Chen Tianhang, Wang Xiangeng, Chu Yanyi, Wang Yanjing, Jiang Mingming, Wei Dong-Qing, Xiong Yi


SHAP (SHapley additive exPlanations), extreme gradient boosting, feature secelction, interpretable analysis, type IV secreted effector

Cardiology Cardiology

Usefulness of machine learning in COVID-19 for the detection and prognosis of cardiovascular complications.

In Reviews in cardiovascular medicine

Since January 2020, coronavirus disease 2019 (COVID-19) has rapidly become a global concern, and its cardiovascular manifestations have highlighted the need for fast, sensitive and specific tools for early identification and risk stratification. Machine learning is a software solution with the ability to analyze large amounts of data and make predictions without prior programming. When faced with new problems with unique challenges as evident in the COVID-19 pandemic, machine learning can offer solutions that are not apparent on the surface by sifting quickly through massive quantities of data and making associations that may have been missed. Artificial intelligence is a broad term that encompasses different tools, including various types of machine learning and deep learning. Here, we review several cardiovascular applications of machine learning and artificial intelligence and their potential applications to cardiovascular diagnosis, prognosis, and therapy in COVID-19 infection.

Zimmerman Allison, Kalra Dinesh


COVID-19, artificial intelligence, cardiovascular, machine learning