Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Collaborative deep learning improves disease-related circRNA prediction based on multi-source functional information.

In Briefings in bioinformatics

Emerging studies have shown that circular RNAs (circRNAs) are involved in a variety of biological processes and play a key role in disease diagnosing, treating and inferring. Although many methods, including traditional machine learning and deep learning, have been developed to predict associations between circRNAs and diseases, the biological function of circRNAs has not been fully exploited. Some methods have explored disease-related circRNAs based on different views, but how to efficiently use the multi-view data about circRNA is still not well studied. Therefore, we propose a computational model to predict potential circRNA-disease associations based on collaborative learning with circRNA multi-view functional annotations. First, we extract circRNA multi-view functional annotations and build circRNA association networks, respectively, to enable effective network fusion. Then, a collaborative deep learning framework for multi-view information is designed to get circRNA multi-source information features, which can make full use of the internal relationship among circRNA multi-view information. We build a network consisting of circRNAs and diseases by their functional similarity and extract the consistency description information of circRNAs and diseases. Last, we predict potential associations between circRNAs and diseases based on graph auto encoder. Our computational model has better performance in predicting candidate disease-related circRNAs than the existing ones. Furthermore, it shows the high practicability of the method that we use several common diseases as case studies to find some unknown circRNAs related to them. The experiments show that CLCDA can efficiently predict disease-related circRNAs and are helpful for the diagnosis and treatment of human disease.

Wang Yongtian, Liu Xinmeng, Shen Yewei, Song Xuerui, Wang Tao, Shang Xuequn, Peng Jiajie

2023-Feb-27

circRNA, collaborative deep learning, disease, multi-view functional annotation

General General

Machine Learning Models for Predicting Molecular UV-Vis Spectra with Quantum Mechanical Properties.

In Journal of chemical information and modeling

Accurate understanding of ultraviolet-visible (UV-vis) spectra is critical for the high-throughput synthesis of compounds for drug discovery. Experimentally determining UV-vis spectra can become expensive when dealing with a large quantity of novel compounds. This provides us an opportunity to drive computational advances in molecular property predictions using quantum mechanics and machine learning methods. In this work, we use both quantum mechanically (QM) predicted and experimentally measured UV-vis spectra as input to devise four different machine learning architectures, UVvis-SchNet, UVvis-DTNN, UVvis-Transformer, and UVvis-MPNN, and assess the performance of each method. We find that the UVvis-MPNN model outperforms the other models when using optimized 3D coordinates and QM predicted spectra as input features. This model has the highest performance for predicting UV-vis spectra with a training RMSE of 0.06 and validation RMSE of 0.08. Most importantly, our model can be used for the challenging task of predicting differences in the UV-vis spectral signatures of regioisomers.

McNaughton Andrew D, Joshi Rajendra P, Knutson Carter R, Fnu Anubhav, Luebke Kevin J, Malerich Jeremiah P, Madrid Peter B, Kumar Neeraj

2023-Feb-27

Public Health Public Health

A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases.

In mSystems

Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that are associated with the swab site locations. However, the swab site location information is currently collected in a single, free-text, "isolation source", field-promoting generation of poorly detailed descriptions with various word order, granularity, and linguistic errors, making automation difficult and reducing machine-actionability. We assessed 1,498 free-text swab site descriptions that were generated during routine foodborne pathogen surveillance. The lexicon of free-text metadata was evaluated to determine the informational facets and the quantity of unique terms used by data collectors. Open Biological Ontologies (OBO) Foundry libraries were used to develop hierarchical vocabularies that are connected with logical relationships to describe swab site locations. 5 informational facets that were described by 338 unique terms were identified via content analysis. Term hierarchy facets were developed, as were statements (called axioms) about how the entities within these five domains are related. The schema developed through this study has been integrated into a publicly available pathogen metadata standard, facilitating ongoing surveillance and investigations. The One Health Enteric Package was available at NCBI BioSample, beginning in 2022. The collective use of metadata standards increases the interoperability of DNA sequence databases and enables large-scale approaches to data sharing and artificial intelligence as well as big-data solutions to food safety. IMPORTANCE The regular analysis of whole-genome sequence data in collections such as NCBI's Pathogen Detection Database is used by many public health organizations to detect outbreaks of infectious disease. However, isolate metadata in these databases are often incomplete and of poor quality. These complex, raw metadata must often be reorganized and manually formatted for use in aggregate analyses. These processes are inefficient and time-consuming, increasing the interpretative labor needed by public health groups to extract actionable information. The future use of open genomic epidemiology networks will be supported through the development of an internationally applicable vocabulary system with which swab site locations can be described.

Feng Jingzhang, Daeschel Devin, Dooley Damion, Griffiths Emma, Allard Marc, Timme Ruth, Chen Yi, Snyder Abigail B

2023-Feb-27

epidemiology, foodborne pathogen, genomic surveillance, informatics

Pathology Pathology

Machine Learning Prediction and Phyloanatomic Modeling of Viral Neuroadaptive Signatures in the Macaque Model of HIV-Mediated Neuropathology.

In Microbiology spectrum

In human immunodeficiency virus (HIV) infection, virus replication in and adaptation to the central nervous system (CNS) can result in neurocognitive deficits in approximately 25% of patients with unsuppressed viremia. While no single viral mutation can be agreed upon as distinguishing the neuroadapted population, earlier studies have demonstrated that a machine learning (ML) approach could be applied to identify a collection of mutational signatures within the virus envelope glycoprotein (Gp120) predictive of disease. The S[imian]IV-infected macaque is a widely used animal model of HIV neuropathology, allowing in-depth tissue sampling infeasible for human patients. Yet, translational impact of the ML approach within the context of the macaque model has not been tested, much less the capacity for early prediction in other, noninvasive tissues. We applied the previously described ML approach to prediction of SIV-mediated encephalitis (SIVE) using gp120 sequences obtained from the CNS of animals with and without SIVE with 97% accuracy. The presence of SIVE signatures at earlier time points of infection in non-CNS tissues indicated these signatures cannot be used in a clinical setting; however, combined with protein structural mapping and statistical phylogenetic inference, results revealed common denominators associated with these signatures, including 2-acetamido-2-deoxy-beta-d-glucopyranose structural interactions and high rate of alveolar macrophage (AM) infection. AMs were also determined to be the phyloanatomic source of cranial virus in SIVE animals, but not in animals that did not develop SIVE, implicating a role for these cells in the evolution of the signatures identified as predictive of both HIV and SIV neuropathology. IMPORTANCE HIV-associated neurocognitive disorders remain prevalent among persons living with HIV (PLWH) owing to our limited understanding of the contributing viral mechanisms and ability to predict disease onset. We have expanded on a machine learning method previously used on HIV genetic sequence data to predict neurocognitive impairment in PLWH to the more extensively sampled SIV-infected macaque model in order to (i) determine the translatability of the animal model and (ii) more accurately characterize the predictive capacity of the method. We identified eight amino acid and/or biochemical signatures in the SIV envelope glycoprotein, the most predominant of which demonstrated the potential for aminoglycan interaction characteristic of previously identified HIV signatures. These signatures were not isolated to specific points in time or to the central nervous system, limiting their use as an accurate clinical predictor of neuropathogenesis; however, statistical phylogenetic and signature pattern analyses implicate the lungs as a key player in the emergence of neuroadapted viruses.

Ramirez-Mata Andrea S, Ostrov David, Salemi Marco, Marini Simone, Magalis Brittany Rife

2023-Feb-27

HIV, SIV, envelope, machine learning, neuroAIDS, neuroadaptation, neuropathology, phyloanatomy, phylogenetic

General General

Estimating resistance surfaces using gradient forest and allelic frequencies.

In Molecular ecology resources

Understanding landscape connectivity has become a global priority for mitigating the impact of landscape fragmentation on biodiversity. Connectivity methods that use link-based methods traditionally rely on relating pairwise genetic distance between individuals or demes to their landscape distance (e.g., geographic distance, cost distance). In this study, we present an alternative to conventional statistical approaches to refine cost surfaces by adapting the gradient forest approach to produce a resistance surface. Used in community ecology, gradient forest is an extension of random forest, and has been implemented in genomic studies to model species genetic offset under future climatic scenarios. By design, this adapted method, resGF, has the ability to handle multiple environmental predicators and is not subjected to traditional assumptions of linear models such as independence, normality and linearity. Using genetic simulations, resistance Gradient Forest (resGF) performance was compared to other published methods (maximum likelihood population effects model, random forest-based least-cost transect analysis and species distribution model). In univariate scenarios, resGF was able to distinguish the true surface contributing to genetic diversity among competing surfaces better than the compared methods. In multivariate scenarios, the gradient forest approach performed similarly to the other random forest-based approach using least-cost transect analysis but outperformed MLPE-based methods. Additionally, two worked examples are provided using two previously published datasets. This machine learning algorithm has the potential to improve our understanding of landscape connectivity and inform long-term biodiversity conservation strategies.

Vanhove Mathieu, Launey Sophie

2023-Feb-27

functional connectivity, gradient forest, isolation by resistance, landscape genetics, machine learning, resistance surface

General General

THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.

In eLife

Understanding object representations requires a broad, comprehensive sampling of the objects in our visual world with dense measurements of brain activity and behavior. Here we present THINGS-data, a multimodal collection of large-scale neuroimaging and behavioral datasets in humans, comprising densely-sampled functional MRI and magnetoencephalographic recordings, as well as 4.70 million similarity judgments in response to thousands of photographic images for up to 1,854 object concepts. THINGS-data is unique in its breadth of richly-annotated objects, allowing for testing countless hypotheses at scale while assessing the reproducibility of previous findings. Beyond the unique insights promised by each individual dataset, the multimodality of THINGS-data allows combining datasets for a much broader view into object processing than previously possible. Our analyses demonstrate the high quality of the datasets and provide five examples of hypothesis-driven and data-driven applications. THINGS-data constitutes the core public release of the THINGS initiative (https://things-initiative.org) for bridging the gap between disciplines and the advancement of cognitive neuroscience.

Hebart Martin N, Contier Oliver, Teichmann Lina, Rockter Adam H, Zheng Charles Y, Kidder Alexis, Corriveau Anna, Vaziri-Pashkam Maryam, Baker Chris I

2023-Feb-27

human, neuroscience