Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

bioRxiv Preprint

GUIdEStaR integrates existing databases of important compositional and structural elements of sequences- various types of G-quadruplex, upstream open reading frame (uORF), Internal Ribosome Entry Site (IRES), epigenetic modification (histone protein and RNA), and repeats. It contains binary information (presence/absence of the elements) that are organized into 5 regions (5'UTR, 3'UTR, exon, intron, and biological region) per transcript and per gene. These elements are highly interdependent in controlling functional interaction of a gene. The database contains information of approx. 40,000 genes and 320,000 transcripts, where each transcript has 845 presence/absence information. Recently, artificial intelligence (AI) based analysis of sequencing data has been gaining popularity in the area of bioinformatics. To create a dataset that can be used as an input to AI methods, GUIdEStaR comes with example Java codes. Here, we demonstrates the database usage with three neural network classification examples- 1) small RNA example for identifying the attributes that are unique to transcription factor (TF) genes mediated by small RNAs originated from SARS-CoV-2 vs. from human, 2) cell membrane receptor study for classifying virus interacting vs. non-interacting receptors, and 3) receptors targeted by nonsense mediated mRNA decay (NMD) vs. of non-target. GUIdEStaR is freely available at www.guidestar.kr and https://sourceforge.net/projects/guidestar.

Kang, J. E.

2021-04-04