In BMC bioinformatics
BACKGROUND : Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible-therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms' hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured.
RESULTS : We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets-2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial [Formula: see text] effect size: 0.141 versus 0.345, Kendall's concordance index: 0.424 versus 0.138 for d(0, 0, 0)).
CONCLUSIONS : DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik .
Mrukwa Grzegorz, Polanska Joanna
2022-Dec-12
Feature engineering, High-dimensional data analysis, Local feature space adaptation, Machine learning, Mass spectrometry imaging, Omics, Tumour heterogeneity, Unsupervised clustering