Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

RAINFOREST: a random forest approach to predict treatment benefit in data from (failed) clinical drug trials.

In Bioinformatics (Oxford, England)

MOTIVATION : When phase III clinical drug trials fail their endpoint, enormous resources are wasted. Moreover, even if a clinical trial demonstrates a significant benefit, the observed effects are often small and may not outweigh the side effects of the drug. Therefore, there is a great clinical need for methods to identify genetic markers that can identify subgroups of patients which are likely to benefit from treatment as this may (i) rescue failed clinical trials and/or (ii) identify subgroups of patients which benefit more than the population as a whole. When single genetic biomarkers cannot be found, machine learning approaches that find multivariate signatures are required. For single nucleotide polymorphism (SNP) profiles, this is extremely challenging owing to the high dimensionality of the data. Here, we introduce RAINFOREST (tReAtment benefIt prediction using raNdom FOREST), which can predict treatment benefit from patient SNP profiles obtained in a clinical trial setting.

RESULTS : We demonstrate the performance of RAINFOREST on the CAIRO2 dataset, a phase III clinical trial which tested the addition of cetuximab treatment for metastatic colorectal cancer and concluded there was no benefit. However, we find that RAINFOREST is able to identify a subgroup comprising 27.7% of the patients that do benefit, with a hazard ratio of 0.69 (P = 0.04) in favor of cetuximab. The method is not specific to colorectal cancer and could aid in reanalysis of clinical trial data and provide a more personalized approach to cancer treatment, also when there is no clear link between a single variant and treatment benefit.

AVAILABILITY AND IMPLEMENTATION : The R code used to produce the results in this paper can be found at github.com/jubels/RAINFOREST. A more configurable, user-friendly Python implementation of RAINFOREST is also provided. Due to restrictions based on privacy regulations and informed consent of participants, phenotype and genotype data of the CAIRO2 trial cannot be made freely available in a public repository. Data from this study can be obtained upon request. Requests should be directed toward Prof. Dr. H.J. Guchelaar (h.j.guchelaar@lumc.nl).

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Ubels Joske, Schaefers Tilman, Punt Cornelis, Guchelaar Henk-Jan, de Ridder Jeroen

2020-Dec-30

General General

A Siamese neural network model for the prioritization of metabolic disorders by integrating real and simulated data.

In Bioinformatics (Oxford, England)

MOTIVATION : Untargeted metabolomic approaches hold a great promise as a diagnostic tool for inborn errors of metabolisms (IEMs) in the near future. However, the complexity of the involved data makes its application difficult and time consuming. Computational approaches, such as metabolic network simulations and machine learning, could significantly help to exploit metabolomic data to aid the diagnostic process. While the former suffers from limited predictive accuracy, the latter is normally able to generalize only to IEMs for which sufficient data are available. Here, we propose a hybrid approach that exploits the best of both worlds by building a mapping between simulated and real metabolic data through a novel method based on Siamese neural networks (SNN).

RESULTS : The proposed SNN model is able to perform disease prioritization for the metabolic profiles of IEM patients even for diseases that it was not trained to identify. To the best of our knowledge, this has not been attempted before. The developed model is able to significantly outperform a baseline model that relies on metabolic simulations only. The prioritization performances demonstrate the feasibility of the method, suggesting that the integration of metabolic models and data could significantly aid the IEM diagnosis process in the near future.

AVAILABILITY AND IMPLEMENTATION : Metabolic datasets used in this study are publicly available from the cited sources. The original data produced in this study, including the trained models and the simulated metabolic profiles, are also publicly available (Messa et al., 2020).

Messa Gian Marco, Napolitano Francesco, Elsea Sarah H, di Bernardo Diego, Gao Xin

2020-Dec-30

General General

SCHNEL: scalable clustering of high dimensional single-cell data.

In Bioinformatics (Oxford, England)

MOTIVATION : Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets.

RESULTS : We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST.

AVAILABILITY AND IMPLEMENTATION : Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Abdelaal Tamim, de Raadt Paul, Lelieveldt Boudewijn P F, Reinders Marcel J T, Mahfouz Ahmed

2020-Dec-30

General General

Graph convolutional networks for epigenetic state prediction using both sequence and 3D genome data.

In Bioinformatics (Oxford, England)

MOTIVATION : Predictive models of DNA chromatin profile (i.e. epigenetic state), such as transcription factor binding, are essential for understanding regulatory processes and developing gene therapies. It is known that the 3D genome, or spatial structure of DNA, is highly influential in the chromatin profile. Deep neural networks have achieved state of the art performance on chromatin profile prediction by using short windows of DNA sequences independently. These methods, however, ignore the long-range dependencies when predicting the chromatin profiles because modeling the 3D genome is challenging.

RESULTS : In this work, we introduce ChromeGCN, a graph convolutional network for chromatin profile prediction by fusing both local sequence and long-range 3D genome information. By incorporating the 3D genome, we relax the independent and identically distributed assumption of local windows for a better representation of DNA. ChromeGCN explicitly incorporates known long-range interactions into the modeling, allowing us to identify and interpret those important long-range dependencies in influencing chromatin profiles. We show experimentally that by fusing sequential and 3D genome data using ChromeGCN, we get a significant improvement over the state-of-the-art deep learning methods as indicated by three metrics. Importantly, we show that ChromeGCN is particularly useful for identifying epigenetic effects in those DNA windows that have a high degree of interactions with other DNA windows.

AVAILABILITY AND IMPLEMENTATION : https://github.com/QData/ChromeGCN.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Lanchantin Jack, Qi Yanjun

2020-Dec-30

General General

Geometricus represents protein structures as shape-mers derived from moment invariants.

In Bioinformatics (Oxford, England)

MOTIVATION : As the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well.

RESULTS : We present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering and structure classification across proteins from different superfamilies as well as within the same family.

AVAILABILITY AND IMPLEMENTATION : Python code available at https://git.wur.nl/durai001/geometricus.

Durairaj Janani, Akdel Mehmet, de Ridder Dick, van Dijk Aalt D J

2020-Dec-30

Pathology Pathology

Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation.

In JAMIA open

Objective : Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings.

Materials and methods : Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model's estimates of its likelihood of being correct.

Results : Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields.

Conclusions : We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates.

Odisho Anobel Y, Park Briton, Altieri Nicholas, DeNero John, Cooperberg Matthew R, Carroll Peter R, Yu Bin

2020-Oct

cancer, information extraction, machine learning, natural language processing, pathology, prostate cancer