bioRxiv Preprint

GUIdEStaR integrates existing databases of important compositional and structural elements of sequences- various types of G-quadruplex, upstream open reading frame (uORF), Internal Ribosome Entry Site (IRES), epigenetic modification (histone protein and RNA), and repeats. It contains binary information (presence/absence of the elements) that are organized into 5 regions (5'UTR, 3'UTR, exon, intron, and biological region) per transcript and per gene. These elements are highly interdependent in controlling functional interaction of a gene. The database contains information of approx. 40,000 genes and 320,000 transcripts, where each transcript has 845 presence/absence information. Recently, artificial intelligence (AI) based analysis of sequencing data has been gaining popularity in the area of bioinformatics. To create a dataset that can be used as an input to AI methods, GUIdEStaR comes with example Java codes. Here, we demonstrates the database usage with three neural network classification examples- 1) small RNA example for identifying the attributes that are unique to transcription factor (TF) genes mediated by small RNAs originated from SARS-CoV-2 vs. from human, 2) cell membrane receptor study for classifying virus interacting vs. non-interacting receptors, and 3) receptors targeted by nonsense mediated mRNA decay (NMD) vs. of non-target. GUIdEStaR is freely available at and

Kang, J. E.