Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Model-Based Autoencoders for Imputing Discrete single-cell RNA-seq Data.

In Methods (San Diego, Calif.)

Deep neural networks have been widely applied for missing data imputation. However, most existing studies have been focused on imputing continuous data, while discrete data imputation is under-explored. Discrete data is common in real world, especially in research areas of bioinformatics, genetics, and biochemistry. In particular, large amounts of recent genomic data are discrete count data generated from single-cell RNA sequencing (scRNA-seq) technology. Most scRNA-seq studies produce a discrete matrix with prevailing 'false' zero count observations (missing values). To make downstream analyses more effective, imputation, which recovers the missing values, is often conducted as the first step in pre-processing scRNA-seq data. In this paper, we propose a novel Zero-Inflated Negative Binomial (ZINB) model-based autoencoder for imputing discrete scRNA-seq data. The novelties of our method are twofold. First, in addition to optimizing the ZINB likelihood, we propose to explicitly model the dropout events that cause missing values by using the Gumbel-Softmax distribution. Second, the zero-inflated reconstruction is further optimized with respect to the raw count matrix. Extensive experiments on simulation datasets demonstrate that the zero-inflated reconstruction significantly improves imputation accuracy. Real data experiments show that the proposed imputation can enhance separating different cell types and improve the accuracy of differential expression analysis.

Tian Tian, Min Martin Renqiang, Wei Zhi


Deep learning, Imputation, ScRNA-seq

General General

Fast and Flexible Protein Design Using Deep Graph Neural Networks.

In Cell systems

Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. We show that a deep graph neural network, ProteinSolver, can precisely design sequences that fold into a predetermined shape by phrasing this challenge as a constraint satisfaction problem (CSP), akin to Sudoku puzzles. We trained ProteinSolver on over 70,000,000 real protein sequences corresponding to over 80,000 structures. We show that our method rapidly designs new protein sequences and benchmark them in silico using energy-based scores, molecular dynamics, and structure prediction methods. As a proof-of-principle validation, we use ProteinSolver to generate sequences that match the structure of serum albumin, then synthesize the top-scoring design and validate it in vitro using circular dichroism. ProteinSolver is freely available at and A record of this paper's transparent peer review process is included in the Supplemental Information.

Strokach Alexey, Becerra David, Corbi-Verge Carles, Perez-Riba Albert, Kim Philip M


constraint satisfaction problem, deep learning, graph neural networks, inverse protein folding, protein design, protein optimization

General General

Artificial Neural Networks for Neuroscientists: A Primer.

In Neuron ; h5-index 148.0

Artificial neural networks (ANNs) are essential tools in machine learning that have drawn increasing attention in neuroscience. Besides offering powerful techniques for data analysis, ANNs provide a new approach for neuroscientists to build models for complex behaviors, heterogeneous neural activity, and circuit connectivity, as well as to explore optimization in neural systems, in ways that traditional models are not designed for. In this pedagogical Primer, we introduce ANNs and demonstrate how they have been fruitfully deployed to study neuroscientific questions. We first discuss basic concepts and methods of ANNs. Then, with a focus on bringing this mathematical framework closer to neurobiology, we detail how to customize the analysis, structure, and learning of ANNs to better address a wide range of challenges in brain research. To help readers garner hands-on experience, this Primer is accompanied with tutorial-style code in PyTorch and Jupyter Notebook, covering major topics.

Yang Guangyu Robert, Wang Xiao-Jing


General General

Existence and possible roles of independent non-CpG methylation in the mammalian brain.

In DNA research : an international journal for rapid publication of reports on genes and genomes

Methylated non-CpGs (mCpHs) in mammalian cells yield weak enrichment signals and colocalize with methylated CpGs (mCpGs), thus have been considered byproducts of hyperactive methyltransferases. However, mCpHs are cell type-specific and associated with epigenetic regulation, although their dependency on mCpGs remains to be elucidated. In this study, we demonstrated that mCpHs colocalize with mCpGs in pluripotent stem cells, but not in brain cells. In addition, profiling genome-wide methylation patterns using a hidden Markov model revealed abundant genomic regions in which CpGs and CpHs are differentially methylated in brain. These regions were frequently located in putative enhancers, and mCpHs within the enhancers increased in correlation with brain age. The enhancers with hypermethylated CpHs were associated with genes functionally enriched in immune responses, and some of the genes were related to neuroinflammation and degeneration. This study provides insight into the roles of non-CpG methylation as an epigenetic code in the mammalian brain genome.

Lee Jong-Hun, Saito Yutaka, Park Sung-Joon, Nakai Kenta


Hidden Markov model, Neuro-epigenetics, Non-CpG methylation

Radiology Radiology

A Prognostic Predictive System Based on Deep Learning for Locoregionally Advanced Nasopharyngeal Carcinoma.

In Journal of the National Cancer Institute

BACKGROUND : Magnetic resonance imaging (MRI) images are crucial unstructured data for prognostic evaluation in nasopharyngeal carcinoma (NPC). We developed and validated a prognostic system based on the MRI features and clinical data of locoregionally advanced NPC (LA-NPC) patients to distinguish low-risk patients with LA-NPC, for whom concurrent chemoradiotherapy (CCRT) is sufficient.

METHODS : This multicenter, retrospective study included 3444 patients with LA-NPC from January 1, 2010, to January 31, 2017. A three-dimensional convolutional neural network was used to learn the image features from pretreatment MRI images. An eXtreme Gradient Boosting model was trained with the MRI features and clinical data to assign an overall score to each patient. Comprehensive evaluations were implemented to assess the performance of the predictive system. We applied the overall score to distinguish high-risk patients from low-risk patients. The clinical benefit of induction chemotherapy (IC) was analyzed in each risk group by survival curves.

RESULTS : We constructed a prognostic system displaying a concordance index of 0.776 (95% CI = 0.746-0.806) for the internal validation cohort and 0.757 (95% CI = 0.695-0.819), 0.719 (95% CI = 0.650-0.789) and 0.746 (95% CI = 0.699-0.793) for the three external validation cohorts, which presented a statistically significant improvement compared to the conventional tumor-node-metastasis (TNM) staging system. In the high-risk group, patients who received IC plus CCRT had better outcomes than patients who received CCRT alone, while there was no statistically significant difference in the low-risk group.

CONCLUSIONS : The proposed framework can capture more complex and heterogeneous information to predict the prognosis of patients with LA-NPC and potentially contribute to clinical decision making.

Qiang Mengyun, Li Chaofeng, Sun Yuyao, Sun Ying, Ke Liangru, Xie Chuanmiao, Zhang Tao, Zou Yujian, Qiu Wenze, Gao Mingyong, Li Yingxue, Li Xiang, Zhan Zejiang, Liu Kuiyuan, Chen Xi, Liang Chixiong, Chen Qiuyan, Mai Haiqiang, Xie Guotong, Guo Xiang, Lv Xing


General General

Association of violence with urban points of interest.

In PloS one ; h5-index 176.0

The association between alcohol outlets and violence has long been recognised, and is commonly used to inform policing and licensing policies (such as staggered closing times and zoning). Less investigated, however, is the association between violent crime and other urban points of interest, which while associated with the city centre alcohol consumption economy, are not explicitly alcohol outlets. Here, machine learning (specifically, LASSO regression) is used to model the distribution of violent crime for the central 9 km2 of ten large UK cities. Densities of 620 different Point of Interest types (sourced from Ordnance Survey) are used as predictors, with the 10 most explanatory variables being automatically selected for each city. Cross validation is used to test generalisability of each model. Results show that the inclusion of additional point of interest types produces a more accurate model, with significant increases in performance over a baseline univariate alcohol-outlet only model. Analysis of chosen variables for city-specific models shows potential candidates for new strategies on a per-city basis, with combined-model variables showing the general trend in POI/violence association across the UK. Although alcohol outlets remain the best individual predictor of violence, other points of interest should also be considered when modelling the distribution of violence in city centres. The presented method could be used to develop targeted, city-specific initiatives that go beyond alcohol outlets and also consider other locations.

Redfern Joseph, Sidorov Kirill, Rosin Paul L, Corcoran Padraig, Moore Simon C, Marshall David