In Computational biology and chemistry ; h5-index 0.0
At present, tertiary structure discovery growth rate is lagging far behind discovery of primary structure. The prediction of protein structural class using Machine Learning techniques can help reduce this gap. The Structural Classification of Protein - Extended (SCOPe 2.07) is latest and largest dataset available at present. The protein sequences with less than 40% identity to each other are used for predicting α, β, α/β and α + β SCOPe classes. The sensitive features are extracted from primary and secondary structure representations of Proteins. Features are extracted experimentally from secondary structure with respect to its frequency, pitch and spatial arrangements. Primary structure based features contain species information for a protein sequence. The species parameters are further validated with uniref100 dataset using TaxId. As it is known, protein tertiary structure is manifestation of function. Functional differences are observed in species. Hence, the species are expected to have strong correlations with structural class, which is discovered in current work. It enhances prediction accuracy by 7%-10%. The subset of SCOPe 2.07 is trained using 65 dimensional feature vector using Random Forest classifier. The test result for the rest of the set gives consistent accuracy of better than 95%. The accuracy achieved on benchmark datasets ASTRAL 1.73, 25PDB and FC699 is better than 86%, 91% and 97% respectively, which is best reported to our knowledge.
Apurva Mehta, Mazumdar Himanshu
Evolutionary, Low identity, Protein secondary structure, Protein structural class, Random Forest, Structural classification of proteins (SCOP)