In Journal of proteome research
It is well known that DNA-protein binding (DPB) prediction is not only beneficial to understand the regulation mechanism of gene expression but also a challenging task in the field of computational biology. Traditional methods for DPB prediction that depend on manually extracted features may lead to classification errors. Recently, deep learning such as convolutional neural network (CNN) has been successfully applied to classification tasks and improved DPB prediction performance significantly. Yet, these methods are based on the original DNA sequence modeling, ignoring the hidden complex dependency and complementarity between multiple sequence features. In consideration of this problem, we propose a method to fuse different sequence features and analyze them systematically through multi-scale CNN. First, sliding windows of specified lengths are set on distinct DNA sequences to generate multiple sequence features with unequal lengths. Second, multiple feature sequences are fused and encoded for feature representation. Third, multi-scale CNN with different binding motif lengths is used to automatically learn and mine the influence of internal attributes and hidden complex relations between the fusion sequence features and make full use of the complementary advantages of extracted CNN features to predict DPB. When our model is applied to 690 ChIP-seq datasets, it achieves an average AUC of 0.9112, which is significantly better than the latest methods. The results show that our method is effective for DPB prediction and is freely available at http://220.127.116.11/mscDPB/.
Du Xiuquan, Hu Jiajia, Li Shuo
DNA-protein binding, feature fusion, multi-scale complementary feature