In bioRxiv : the preprint server for biology
MOTIVATION : Understanding the landscape of natural selection in humans and other species has been a major focus for the use of machine learning methods in population genetics. Existing methods rely on computationally intensive simulated training data incorporating selection. Unlike efficient neutral coalescent simulations for demographic inference, realistic selection typically requires slow forward simulations. Large populations sizes (for example due to recent exponential growth in humans) make these simulations even more prohibitive. Because there are many possible modes of selection, a high dimensional parameter space must be explored, with no guarantee that the simulated models are close to the real processes. Since machine learning methods use the simulated data for training, mismatches between simulated training data and real test data are particularly problematic. In addition, it has been difficult to interpret the trained neural networks, leading to a lack of understanding about what features contribute to identifying selected variants.
RESULTS : Here we develop a new approach to detect selection that does not require selection simulations during training. We use a Generative Adversarial Network (GAN) that has been trained to simulate neutral data that mirrors a real genomic dataset. The resulting GAN consists of a generator (demographic model) and a discriminator (convolutional neural network). For a given genomic region, the discriminator predicts whether it is "real" genomic data or "fake" in the sense that it could have been simulated by the generator. As the "real" training data includes regions that experienced selection and the generator cannot produce such regions, regions with a high probability of being real may have experienced selection. This enables us to apply the trained discriminator of the GAN to held-out test data and identify candidate selected regions. We show that this approach has high power to identify regions under selection in simulations, and that it reliably identifies selected regions identified by state-of-the art population genetic methods in three human populations (YRI, CEU, and CHB). Finally, we show how to interpret the trained networks by clustering hidden units of the discriminator based on their correlation patterns with known summary statistics. In summary, our approach is a novel, efficient, and powerful way to use machine learning to detect natural selection.
AVAILABILITY : Our software is available open-source at https://github.com/mathiesonlab/disc-pg-gan.
Riley Rebecca, Mathieson Iain, Mathieson Sara
2023-Mar-08