While intrinsic molecular subtypes provide important biological classification of breast cancer, the subtype assignment of individuals is influenced by assay technology and study cohort composition. We sought to develop a platform-independent absolute single-sample subtype classifier based on a minimal number of genes. Pairwise ratios for subtype-specific differentially expressed genes from un-normalized expression data from 432 breast cancer (BC) samples of The Cancer Genome Atlas (TCGA) were used as inputs for machine learning. The subtype classifier with the fewest number of genes and maximal classification power was selected during cross-validation. The final model was evaluated on 5816 samples from 10 independent studies profiled with four different assay platforms. Upon cross-validation within the TCGA cohort, a random forest classifier (MiniABS) with 11 genes achieved the best accuracy of 88.2%. Applying MiniABS to five validation sets of RNA-seq and microarray data showed an average accuracy of 85.15% (vs. 77.72% for Absolute Intrinsic Molecular Subtype (AIMS)). Only MiniABS could be applied to five low-throughput datasets, showing an average accuracy of 87.93%. The MiniABS can absolutely subtype BC using the raw expression levels of only 11 genes, regardless of assay platform, with higher accuracy than existing methods.
Seo Mi-Kyoung, Paik Soonmyung, Kim Sangwoo
breast cancer, classifier, machine learning, optimization, subtyping