In Genomics, proteomics & bioinformatics
The accurate annotation of transcription start sites (TSSs) and their usage is critical for the mechanistic understanding of gene regulation in different biological contexts. To fulfill this, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner and various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset, thus resulting in drastic false positive predictions when applied on the genome-scale. Here, we present DeeReCT-TSS, a deep learning-based method that is capable of identifying TSSs across the whole genome based on both DNA sequence and conventional RNA sequencing data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we have developed a meta-learning-based extension for simultaneous TSS annotation on 10 cell types, which enables the identification of cell type-specific TSSs. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets by correlating our predicted TSSs with experimentally defined TSS chromatin states. The source code for DeeReCT-TSS is available at: https://github.com/JoshuaChou2018/DeeReCT-TSS_release and https://ngdc.cncb.ac.cn/biocode/tools/BT007316.
Zhou Juexiao, Zhang Bin, Li Haoyang, Zhou Longxi, Li Zhongxiao, Long Yongkang, Han Wenkai, Wang Mengran, Cui Huanhuan, Li Jingjing, Chen Wei, Gao Xin
2022-Dec-14
Deep learning, Machine learning, Meta-learning, RNA sequencing, Transcription start sites