Predicting the SARS-CoV-2 epidemic and "immune escape" mutations remain crucial problems. We present a theoretical framework called Phenotype-Embedding (P-E) theorem and prove that the virus fitness can calculate by selecting appropriate sequence embedding under the VAE framework. Starting from the P-E theorem and based on a modified Transformer model, we obtain a calculable quantitative relationship between "immune escape" mutations and the fitness of the virus lineage and plot a genotype-fitness landscape in the embedded space. We accurately calculated the viral fitness and basic replication number (R0) using only the sequence data of SARS-CoV-2 spike protein. In addition, our model can simulate viral neutral evolution and spatio-temporal selection, decipher the effects of epistasis and recombination, and more accurately predict viral mutations associated with immune escape. Our work provides a theoretical framework for constructing genotype-phenotype landscapes and a paradigm for the interpretability of deep learning in virus evolution research.
Liu, Y.; Luo, Y.; Lu, X.; Gao, H.; He, R.; Zhang, X.; Zhang, X.; Li, Y.