In IEEE/ACM transactions on computational biology and bioinformatics
Non-coding RNAs (ncRNAs) play an important role in various biological processes and are associated with diseases. Distinguishing between coding RNAs and ncRNAs, also known as predicting coding potential of RNA sequences, is critical for downstream biological function analysis. Many machine learning-based methods have been proposed for predicting coding potential of RNA sequences. Recent studies reveal that most existing methods have poor performance on RNA sequences with short Open Reading Frames (sORF, ORF length<303nt). In this work, we analyze the distribution of ORF length of RNA sequences, and observe that the number of coding RNAs with sORF is inadequate and coding RNAs with sORF are much less than ncRNAs with sORF. Thus, there exists the problem of local data imbalance in RNA sequences with sORF. We propose a coding potential prediction method CPE-SLDI, which uses data oversampling techniques to augment samples for coding RNAs with sORF so as to alleviate local data imbalance. Compared with existing methods, CPE-SLDI produces the better performances, and studies reveal that the data augmentation by various data oversampling techniques can enhance the performance of coding potential prediction, especially for RNA sequences with sORF. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPESLDI.
Chen Xian-Gan, Liu Shuai, Zhang Wen