Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

A Siamese neural network model for the prioritization of metabolic disorders by integrating real and simulated data.

In Bioinformatics (Oxford, England)

MOTIVATION : Untargeted metabolomic approaches hold a great promise as a diagnostic tool for inborn errors of metabolisms (IEMs) in the near future. However, the complexity of the involved data makes its application difficult and time consuming. Computational approaches, such as metabolic network simulations and machine learning, could significantly help to exploit metabolomic data to aid the diagnostic process. While the former suffers from limited predictive accuracy, the latter is normally able to generalize only to IEMs for which sufficient data are available. Here, we propose a hybrid approach that exploits the best of both worlds by building a mapping between simulated and real metabolic data through a novel method based on Siamese neural networks (SNN).

RESULTS : The proposed SNN model is able to perform disease prioritization for the metabolic profiles of IEM patients even for diseases that it was not trained to identify. To the best of our knowledge, this has not been attempted before. The developed model is able to significantly outperform a baseline model that relies on metabolic simulations only. The prioritization performances demonstrate the feasibility of the method, suggesting that the integration of metabolic models and data could significantly aid the IEM diagnosis process in the near future.

AVAILABILITY AND IMPLEMENTATION : Metabolic datasets used in this study are publicly available from the cited sources. The original data produced in this study, including the trained models and the simulated metabolic profiles, are also publicly available (Messa et al., 2020).

Messa Gian Marco, Napolitano Francesco, Elsea Sarah H, di Bernardo Diego, Gao Xin

2020-Dec-30

General General

SCHNEL: scalable clustering of high dimensional single-cell data.

In Bioinformatics (Oxford, England)

MOTIVATION : Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets.

RESULTS : We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST.

AVAILABILITY AND IMPLEMENTATION : Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Abdelaal Tamim, de Raadt Paul, Lelieveldt Boudewijn P F, Reinders Marcel J T, Mahfouz Ahmed

2020-Dec-30

General General

Graph convolutional networks for epigenetic state prediction using both sequence and 3D genome data.

In Bioinformatics (Oxford, England)

MOTIVATION : Predictive models of DNA chromatin profile (i.e. epigenetic state), such as transcription factor binding, are essential for understanding regulatory processes and developing gene therapies. It is known that the 3D genome, or spatial structure of DNA, is highly influential in the chromatin profile. Deep neural networks have achieved state of the art performance on chromatin profile prediction by using short windows of DNA sequences independently. These methods, however, ignore the long-range dependencies when predicting the chromatin profiles because modeling the 3D genome is challenging.

RESULTS : In this work, we introduce ChromeGCN, a graph convolutional network for chromatin profile prediction by fusing both local sequence and long-range 3D genome information. By incorporating the 3D genome, we relax the independent and identically distributed assumption of local windows for a better representation of DNA. ChromeGCN explicitly incorporates known long-range interactions into the modeling, allowing us to identify and interpret those important long-range dependencies in influencing chromatin profiles. We show experimentally that by fusing sequential and 3D genome data using ChromeGCN, we get a significant improvement over the state-of-the-art deep learning methods as indicated by three metrics. Importantly, we show that ChromeGCN is particularly useful for identifying epigenetic effects in those DNA windows that have a high degree of interactions with other DNA windows.

AVAILABILITY AND IMPLEMENTATION : https://github.com/QData/ChromeGCN.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Lanchantin Jack, Qi Yanjun

2020-Dec-30

General General

Geometricus represents protein structures as shape-mers derived from moment invariants.

In Bioinformatics (Oxford, England)

MOTIVATION : As the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well.

RESULTS : We present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering and structure classification across proteins from different superfamilies as well as within the same family.

AVAILABILITY AND IMPLEMENTATION : Python code available at https://git.wur.nl/durai001/geometricus.

Durairaj Janani, Akdel Mehmet, de Ridder Dick, van Dijk Aalt D J

2020-Dec-30

Pathology Pathology

Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation.

In JAMIA open

Objective : Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings.

Materials and methods : Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model's estimates of its likelihood of being correct.

Results : Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields.

Conclusions : We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates.

Odisho Anobel Y, Park Briton, Altieri Nicholas, DeNero John, Cooperberg Matthew R, Carroll Peter R, Yu Bin

2020-Oct

cancer, information extraction, machine learning, natural language processing, pathology, prostate cancer

General General

Multi-source remote sensing image classification based on two-channel densely connected convolutional networks.

In Mathematical biosciences and engineering : MBE

Remote sensing image classification exploiting multiple sensors is a very challenging problem: The traditional methods based on the medium- or low-resolution remote sensing images always provide low accuracy and poor automation level because the potential of multi-source remote sensing data are not fully utilized and the low-level features are not effectively organized. The recent method based on deep learning can efficiently improve the classification accuracy, but as the depth of deep neural network increases, the network is prone to be overfitting. In order to address these problems, a novel Two-channel Densely Connected Convolutional Networks (TDCC) is proposed to automatically classify the ground surfaces based on deep learning and multi-source remote sensing data. The main contributions of this paper includes the following aspects: First, the multi-source remote sensing data consisting of hyperspectral image (HSI) and Light Detection and Ranging (LiDAR) are pre-processed and re-sampled, and then the hyperspectral data and LiDAR data are input into the feature extraction channel, respectively. Secondly, two-channel densely connected convolutional networks for feature extraction were proposed to automatically extract the spatial-spectral feature of HSI and LiDAR. Thirdly, a feature fusion network is designed to fuse the hyperspectral image features and LiDAR features. The fused features were classified and the output result is the category of the corresponding pixel. The experiments were conducted on popular dataset, the results demonstrate that the competitive performance of the TDCC with respect to classification performance compared with other state-of-the-art classification methods in terms of the OA, AA and Kappa, and it is more suitable for the classification of complex ground surfaces.

Song Haifeng, Yang Weiwei, Dai Songsong, Yuan Haiyan

2020-Oct-27

** LiDAR image , classification , denseNet , hyperspectral image , multi-source **