Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Predictive models for diabetes mellitus using machine learning techniques.

In BMC endocrine disorders

BACKGROUND : Diabetes Mellitus is an increasingly prevalent chronic disease characterized by the body's inability to metabolize glucose. The objective of this study was to build an effective predictive model with high sensitivity and selectivity to better identify Canadian patients at risk of having Diabetes Mellitus based on patient demographic data and the laboratory results during their visits to medical facilities.

METHODS : Using the most recent records of 13,309 Canadian patients aged between 18 and 90 years, along with their laboratory information (age, sex, fasting blood glucose, body mass index, high-density lipoprotein, triglycerides, blood pressure, and low-density lipoprotein), we built predictive models using Logistic Regression and Gradient Boosting Machine (GBM) techniques. The area under the receiver operating characteristic curve (AROC) was used to evaluate the discriminatory capability of these models. We used the adjusted threshold method and the class weight method to improve sensitivity - the proportion of Diabetes Mellitus patients correctly predicted by the model. We also compared these models to other learning machine techniques such as Decision Tree and Random Forest.

RESULTS : The AROC for the proposed GBM model is 84.7% with a sensitivity of 71.6% and the AROC for the proposed Logistic Regression model is 84.0% with a sensitivity of 73.4%. The GBM and Logistic Regression models perform better than the Random Forest and Decision Tree models.

CONCLUSIONS : The ability of our model to predict patients with Diabetes using some commonly used lab results is high with satisfactory sensitivity. These models can be built into an online computer program to help physicians in predicting patients with future occurrence of diabetes and providing necessary preventive interventions. The model is developed and validated on the Canadian population which is more specific and powerful to apply on Canadian patients than existing models developed from US or other populations. Fasting blood glucose, body mass index, high-density lipoprotein, and triglycerides were the most important predictors in these models.

Lai Hang, Huang Huaxiong, Keshavjee Karim, Guergachi Aziz, Gao Xin


Diabetes mellitus, Gradient boosting machine, Machine learning, Misclassification cost, Predictive models

General General

Assessing the multi-pathway threat from an invasive agricultural pest: Tuta absoluta in Asia.

In Proceedings. Biological sciences

Modern food systems facilitate rapid dispersal of pests and pathogens through multiple pathways. The complexity of spread dynamics and data inadequacy make it challenging to model the phenomenon and also to prepare for emerging invasions. We present a generic framework to study the spatio-temporal spread of invasive species as a multi-scale propagation process over a time-varying network accounting for climate, biology, seasonal production, trade and demographic information. Machine learning techniques are used in a novel manner to capture model variability and analyse parameter sensitivity. We applied the framework to understand the spread of a devastating pest of tomato, Tuta absoluta, in South and Southeast Asia, a region at the frontier of its current range. Analysis with respect to historical invasion records suggests that even with modest self-mediated spread capabilities, the pest can quickly expand its range through domestic city-to-city vegetable trade. Our models forecast that within 5-7 years, Tuta absoluta will invade all major vegetable growing areas of mainland Southeast Asia assuming unmitigated spread. Monitoring high-consumption areas can help in early detection, and targeted interventions at major production areas can effectively reduce the rate of spread.

McNitt Joseph, Chungbaek Young Yun, Mortveit Henning, Marathe Madhav, Campos Mateus R, Desneux Nicolas, Brévault Thierry, Muniappan Rangaswamy, Adiga Abhijin


agent-based modelling, biological invasion, epidemic network models, human-mediated spread, insect pests, spread model

General General

Ultrasonic Diagnosis of Breast Nodules Using Modified Faster R-CNN.

In Ultrasonic imaging

Breast cancer has become the biggest threat to female health. Ultrasonic diagnosis of breast cancer based on artificial intelligence is basically a classification of benign and malignant tumors, which does not meet clinical demand. Besides, the current target detection method performs poorly in detecting small lesions, while it is clinically required to detect nodules below 2 mm. The objective of this study is to (a) propose a diagnostic method based on Breast Imaging Reporting and Data System (BI-RADS) and (b) increase its detectability of small lesions. We modified the framework of Faster R-CNN (Faster Region-based Convolutional Neural Network) by introducing multi-scale feature extraction and multi-resolution candidate bound extraction into the network. Then, it was trained using 852 images of BI-RADS C2, 739 images of C3, and 1662 images of malignancy (BI-RADS 4a/4b/4c/5/6). We compared our model with unmodified Faster R-CNN and YOLO v3 (You Only Look Once v3). The mean average precision (mAP) is significantly increased to 0.913, while its average detection speed is slightly declined to 4.11 FPS (frames per second). Meanwhile, its detectivity of small lesions is effectively improved. Moreover, we also tentatively applied our model on video sequences and got satisfactory results. We modified Faster R-CNN and trained it partly based on BI-RADS. Its detectability of lesions, as well as small nodules, was significantly improved. In view of wide coverage of dataset and satisfactory test results, our method can basically meet clinical needs.

Zhang Zihao, Zhang Xuesheng, Lin Xiaona, Dong Licong, Zhang Sure, Zhang Xueling, Sun Desheng, Yuan Kehong


ABUS (Automated Breast Ultrasound), Faster R-CNN, artificial intelligence, breast cancer, nodule detection

General General

Unsupervised and Supervised Learning over theEnergy Landscape for Protein Decoy Selection.

In Biomolecules

The energy landscape that organizes microstates of a molecular system and governs theunderlying molecular dynamics exposes the relationship between molecular form/structure, changesto form, and biological activity or function in the cell. However, several challenges stand in the wayof leveraging energy landscapes for relating structure and structural dynamics to function. Energylandscapes are high-dimensional, multi-modal, and often overly-rugged. Deep wells or basins inthem do not always correspond to stable structural states but are instead the result of inherentinaccuracies in semi-empirical molecular energy functions. Due to these challenges, energeticsis typically ignored in computational approaches addressing long-standing central questions incomputational biology, such as protein decoy selection. In the latter, the goal is to determine over apossibly large number of computationally-generated three-dimensional structures of a protein thosestructures that are biologically-active/native. In recent work, we have recast our attention on theprotein energy landscape and its role in helping us to advance decoy selection. Here, we summarizesome of our successes so far in this direction via unsupervised learning. More importantly, we furtheradvance the argument that the energy landscape holds valuable information to aid and advance thestate of protein decoy selection via novel machine learning methodologies that leverage supervisedlearning. Our focus in this article is on decoy selection for the purpose of a rigorous, quantitativeevaluation of how leveraging protein energy landscapes advances an important problem in proteinmodeling. However, the ideas and concepts presented here are generally useful to make discoveriesin studies aiming to relate molecular structure and structural dynamics to function.

Akhter Nasrin, Chennupati Gopinath, Kabir Kazi Lutful, Djidjev Hristo, Shehu Amarda


basin, decoy selection, energy landscape, machine learning, model quality assessment, purity

General General

D-GPM: A Deep Learning Method for Gene Promoter Methylation Inference.

In Genes

Whole-genome bisulfite sequencing generates a comprehensive profiling of the gene methylation levels, but is limited by a high cost. Recent studies have partitioned the genes into landmark genes and target genes and suggested that the landmark gene expression levels capture adequate information to reconstruct the target gene expression levels. This inspired us to propose that the methylation level of the promoters in landmark genes might be adequate to reconstruct the promoter methylation level of target genes, which would eventually reduce the cost of promoter methylation profiling. Here, we propose a deep learning model called Deep-Gene Promoter Methylation (D-GPM) to predict the whole-genome promoter methylation level based on the promoter methylation profile of the landmark genes from The Cancer Genome Atlas (TCGA). D-GPM-15%-7000 × 5, the optimal architecture of D-GPM, acquires the least overall mean absolute error (MAE) and the highest overall Pearson correlation coefficient (PCC), with values of 0.0329 and 0.8186, respectively, when testing data. Additionally, the D-GPM outperforms the regression tree (RT), linear regression (LR), and the support vector machine (SVM) in 95.66%, 92.65%, and 85.49% of the target genes by virtue of its relatively lower MAE and in 98.25%, 91.00%, and 81.56% of the target genes based on its relatively higher PCC, respectively. More importantly, the D-GPM predominates in predicting 79.86% and 78.34% of the target genes according to the model distribution of the least MAE and the highest PCC, respectively.

Pan Xingxin, Liu Biao, Wen Xingzhao, Liu Yulu, Zhang Xiuqing, Li Shengbin, Li Shuaicheng


deep neural network, landmark genes, machine learning, promoter methylation, target genes

Internal Medicine Internal Medicine

Artificial Intelligence Algorithms and Natural Language Processing for the Recognition of Syncope Patients on Emergency Department Medical Records.

In Journal of clinical medicine

BACKGROUND : Enrollment of large cohorts of syncope patients from administrative data is crucial for proper risk stratification but is limited by the enormous amount of time required for manual revision of medical records.

AIM : To develop a Natural Language Processing (NLP) algorithm to automatically identify syncope from Emergency Department (ED) electronic medical records (EMRs).

METHODS : De-identified EMRs of all consecutive patients evaluated at Humanitas Research Hospital ED from 1 December 2013 to 31 March 2014 and from 1 December 2015 to 31 March 2016 were manually annotated to identify syncope. Records were combined in a single dataset and classified. The performance of combined multiple NLP feature selectors and classifiers was tested. Primary Outcomes: NLP algorithms' accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F3 score.

RESULTS : 15,098 and 15,222 records from 2013 and 2015 datasets were analyzed. Syncope was present in 571 records. Normalized Gini Index feature selector combined with Support Vector Machines classifier obtained the best F3 value (84.0%), with 92.2% sensitivity and 47.4% positive predictive value. A 96% analysis time reduction was computed, compared with EMRs manual review.

CONCLUSIONS : This artificial intelligence algorithm enabled the automatic identification of a large population of syncope patients using EMRs.

Dipaola Franca, Gatti Mauro, Pacetti Veronica, Bottaccioli Anna Giulia, Shiffer Dana, Minonzio Maura, Menè Roberto, Giaj Levra Alessandro, Solbiati Monica, Costantino Giorgio, Anastasio Marco, Sini Elena, Barbic Franca, Brunetta Enrico, Furlan Raffaello


Emergency Department, artificial intelligence, electronic medical records, natural language processing, syncope