Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Ensembling graph attention networks for human microbe-drug association prediction.

In Bioinformatics (Oxford, England)

MOTIVATION : Human microbes get closely involved in an extensive variety of complex human diseases and become new drug targets. In silico methods for identifying potential microbe-drug associations provide an effective complement to conventional experimental methods, which can not only benefit screening candidate compounds for drug development but also facilitate novel knowledge discovery for understanding microbe-drug interaction mechanisms. On the other hand, the recent increased availability of accumulated biomedical data for microbes and drugs provides a great opportunity for a machine learning approach to predict microbe-drug associations. We are thus highly motivated to integrate these data sources to improve prediction accuracy. In addition, it is extremely challenging to predict interactions for new drugs or new microbes, which have no existing microbe-drug associations.

RESULTS : In this work, we leverage various sources of biomedical information and construct multiple networks (graphs) for microbes and drugs. Then, we develop a novel ensemble framework of graph attention networks with a hierarchical attention mechanism for microbe-drug association prediction from the constructed multiple microbe-drug graphs, denoted as EGATMDA. In particular, for each input graph, we design a graph convolutional network with node-level attention to learn embeddings for nodes (i.e. microbes and drugs). To effectively aggregate node embeddings from multiple input graphs, we implement graph-level attention to learn the importance of different input graphs. Experimental results under different cross-validation settings (e.g. the setting for predicting associations for new drugs) showed that our proposed method outperformed seven state-of-the-art methods. Case studies on predicted microbe-drug associations further demonstrated the effectiveness of our proposed EGATMDA method.

AVAILABILITY : Source codes and supplementary materials are available at: https://github.com/longyahui/EGATMDA/.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Long Yahui, Wu Min, Liu Yong, Kwoh Chee Keong, Luo Jiawei, Li Xiaoli

2020-Dec-30

General General

Multi-objective optimization of peel and shear strengths in ultrasonic metal welding using machine learning-based response surface methodology.

In Mathematical biosciences and engineering : MBE

Ultrasonic metal welding (UMW) is a solid-state joining technique with varied industrial applications. Despite of its numerous advantages, UMW has a relative narrow operating window and is sensitive to variations in process conditions. As such, it is imperative to quantitatively characterize the influence of welding parameters on the resulting joint quality. The quantification model can be subsequently used to optimize the parameters. Conventional response surface methodology (RSM) usually employs linear or polynomial models, which may not be able to capture the intricate, nonlin-ear input-output relationships in UMW. Furthermore, some UMW applications call for simultaneous optimization of multiple quality indices such as peel strength, shear strength, electrical conductivity, and thermal conductivity. To address these challenges, this paper develops a machine learning (ML)- based RSM to model the input-output relationships in UMW and jointly optimize two quality indices, namely, peel and shear strengths. The performance of various ML methods including spline regression, Gaussian process regression (GPR), support vector regression (SVR), and conventional polynomial re-gression models with different orders is compared. A case study using experimental data shows that GPR with radial basis function (RBF) kernel and SVR with RBF kernel achieve the best prediction accuracy. The obtained response surface models are then used to optimize a compound joint strength indicator that is defined as the average of normalized shear and peel strengths. In addition, the case study reveals different patterns in the response surfaces of shear and peel strengths, which has not been systematically studied in the literature. While developed for the UMW application, the method can be extended to other manufacturing processes.

Meng Yuquan, Rajagopal Manjunath, Kuntumalla Gowtham, Toro Ricardo, Zhao Hanyang, Chang Ho Chan, Sundar Sreenath, Salapaka Srinivasa, Miljkovic Nenad, Ferreira Placid, Sinha Sanjiv, Shao Chenhui

2020-Oct-28

** machine learning , mechanical strength , process optimization , response surface methodology , ultrasonic metal welding **

General General

CLPred: a sequence-based protein crystallization predictor using BLSTM neural network.

In Bioinformatics (Oxford, England)

MOTIVATION : Determining the structures of proteins is a critical step to understand their biological functions. Crystallography-based X-ray diffraction technique is the main method for experimental protein structure determination. However, the underlying crystallization process, which needs multiple time-consuming and costly experimental steps, has a high attrition rate. To overcome this issue, a series of in silico methods have been developed with the primary aim of selecting the protein sequences that are promising to be crystallized. However, the predictive performance of the current methods is modest.

RESULTS : We propose a deep learning model, so-called CLPred, which uses a bidirectional recurrent neural network with long short-term memory (BLSTM) to capture the long-range interaction patterns between k-mers amino acids to predict protein crystallizability. Using sequence only information, CLPred outperforms the existing deep-learning predictors and a vast majority of sequence-based diffraction-quality crystals predictors on three independent test sets. The results highlight the effectiveness of BLSTM in capturing non-local, long-range inter-peptide interaction patterns to distinguish proteins that can result in diffraction-quality crystals from those that cannot. CLPred has been steadily improved over the previous window-based neural networks, which is able to predict crystallization propensity with high accuracy. CLPred can also be improved significantly if it incorporates additional features from pre-extracted evolutional, structural and physicochemical characteristics. The correctness of CLPred predictions is further validated by the case studies of Sox transcription factor family member proteins and Zika virus non-structural proteins.

AVAILABILITY AND IMPLEMENTATION : https://github.com/xuanwenjing/CLPred.

Xuan Wenjing, Liu Ning, Huang Neng, Li Yaohang, Wang Jianxin

2020-Dec-30

General General

Supervised learning on phylogenetically distributed data.

In Bioinformatics (Oxford, England)

MOTIVATION : The ability to develop robust machine-learning (ML) models is considered imperative to the adoption of ML techniques in biology and medicine fields. This challenge is particularly acute when data available for training is not independent and identically distributed (iid), in which case trained models are vulnerable to out-of-distribution generalization problems. Of particular interest are problems where data correspond to observations made on phylogenetically related samples (e.g. antibiotic resistance data).

RESULTS : We introduce DendroNet, a new approach to train neural networks in the context of evolutionary data. DendroNet explicitly accounts for the relatedness of the training/testing data, while allowing the model to evolve along the branches of the phylogenetic tree, hence accommodating potential changes in the rules that relate genotypes to phenotypes. Using simulated data, we demonstrate that DendroNet produces models that can be significantly better than non-phylogenetically aware approaches. DendroNet also outperforms other approaches at two biological tasks of significant practical importance: antiobiotic resistance prediction in bacteria and trophic level prediction in fungi.

AVAILABILITY AND IMPLEMENTATION : https://github.com/BlanchetteLab/DendroNet.

Layne Elliot, Dort Erika N, Hamelin Richard, Li Yue, Blanchette Mathieu

2020-Dec-30

General General

Matrix (factorization) reloaded: flexible methods for imputing genetic interactions with cross-species and side information.

In Bioinformatics (Oxford, England)

MOTIVATION : Mapping genetic interactions (GIs) can reveal important insights into cellular function and has potential translational applications. There has been great progress in developing high-throughput experimental systems for measuring GIs (e.g. with double knockouts) as well as in defining computational methods for inferring (imputing) unknown interactions. However, existing computational methods for imputation have largely been developed for and applied in baker's yeast, even as experimental systems have begun to allow measurements in other contexts. Importantly, existing methods face a number of limitations in requiring specific side information and with respect to computational cost. Further, few have addressed how GIs can be imputed when data are scarce.

RESULTS : In this article, we address these limitations by presenting a new imputation framework, called Extensible Matrix Factorization (EMF). EMF is a framework of composable models that flexibly exploit cross-species information in the form of GI data across multiple species, and arbitrary side information in the form of kernels (e.g. from protein-protein interaction networks). We perform a rigorous set of experiments on these models in matched GI datasets from baker's and fission yeast. These include the first such experiments on genome-scale GI datasets in multiple species in the same study. We find that EMF models that exploit side and cross-species information improve imputation, especially in data-scarce settings. Further, we show that EMF outperforms the state-of-the-art deep learning method, even when using strictly less data, and incurs orders of magnitude less computational cost.

AVAILABILITY : Implementations of models and experiments are available at: https://github.com/lrgr/EMF.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Fan Jason, Li Xuan Cindy, Crovella Mark, Leiserson Mark D M

2020-Dec-30

General General

svMIL: predicting the pathogenic effect of TAD boundary-disrupting somatic structural variants through multiple instance learning.

In Bioinformatics (Oxford, England)

MOTIVATION : Despite the fact that structural variants (SVs) play an important role in cancer, methods to predict their effect, especially for SVs in non-coding regions, are lacking, leaving them often overlooked in the clinic. Non-coding SVs may disrupt the boundaries of Topologically Associated Domains (TADs), thereby affecting interactions between genes and regulatory elements such as enhancers. However, it is not known when such alterations are pathogenic. Although machine learning techniques are a promising solution to answer this question, representing the large number of interactions that an SV can disrupt in a single feature matrix is not trivial.

RESULTS : We introduce svMIL: a method to predict pathogenic TAD boundary-disrupting SV effects based on multiple instance learning, which circumvents the need for a traditional feature matrix by grouping SVs into bags that can contain any number of disruptions. We demonstrate that svMIL can predict SV pathogenicity, measured through same-sample gene expression aberration, for various cancer types. In addition, our approach reveals that somatic pathogenic SVs alter different regulatory interactions than somatic non-pathogenic SVs and germline SVs.

AVAILABILITY AND IMPLEMENTATION : All code for svMIL is publicly available on GitHub: https://github.com/UMCUGenetics/svMIL.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Nieboer Marleen M, de Ridder Jeroen

2020-Dec-30