DNA N6-methyladenine (6 mA) is an epigenetic modification that plays a vital role in a variety of cellular processes in both eukaryotes and prokaryotes. Accurate information of 6 mA sites in the Rosaceae genome may assist in understanding genomic 6 mA distributions and various biological functions such as epigenetic inheritance. Various studies have shown the possibility of identifying 6 mA sites through experiments, but the procedures are time-consuming and costly. To overcome the drawbacks of experimental methods, we propose an accurate computational paradigm based on a machine learning (ML) technique to identify 6 mA sites in Rosa chinensis (R.chinensis) and Fragaria vesca (F.vesca). To improve the performance of the proposed model and to avoid overfitting, a recursive feature elimination with cross-validation (RFECV) strategy is used to extract the optimal number of features (ONF) subset from five different DNA sequence encoding schemes, i.e., Binary Encoding (BE), Ring-Function-Hydrogen-Chemical Properties (RFHC), Electron-Ion-Interaction Pseudo Potentials of Nucleotides (EIIP), Dinucleotide Physicochemical Properties (DPCP), and Trinucleotide Physicochemical Properties (TPCP). Subsequently, we use the ONF subset to train a double layers of ML-based stacking model to create a bioinformatics tool named 'i6mA-stack'. This tool outperforms its peer tool in general and is currently available at http://nsclbio.jbnu.ac.kr/tools/i6mA-stack/.
Khanal Jhabindra, Lim Dae Young, Tayara Hilal
DNA N6-methyladenine, Machine learning, RFECV, Sequence analysis, Stacking