ArXiv Preprint
Clustering is commonly performed as an initial analysis step for uncovering
structure in 'omics datasets, e.g. to discover molecular subtypes of disease.
The high-throughput, high-dimensional nature of these datasets means that they
provide information on a diverse array of different biomolecular processes and
pathways. Different groups of variables (e.g. genes or proteins) will be
implicated in different biomolecular processes, and hence undertaking analyses
that are limited to identifying just a single clustering partition of the whole
dataset is therefore liable to conflate the multiple clustering structures that
may arise from these distinct processes. To address this, we propose a
multi-view Bayesian mixture model that identifies groups of variables
(``views"), each of which defines a distinct clustering structure. We consider
applications in stratified medicine, for which our principal goal is to
identify clusters of patients that define distinct, clinically actionable
disease subtypes. We adopt the semi-supervised, outcome-guided mixture
modelling approach of Bayesian profile regression that makes use of a response
variable in order to guide inference toward the clusterings that are most
relevant in a stratified medicine context. We present the model, together with
illustrative simulation examples, and examples from pan-cancer proteomics. We
demonstrate how the approach can be used to perform integrative clustering, and
consider an example in which different 'omics datasets are integrated in the
context of breast cancer subtyping.
Paul D. W. Kirk, Filippo Pagani, Sylvia Richardson
2023-03-01