In Journal of computational biology : a journal of computational molecular cell biology
Application of genetic distances to measure phenotypic relatedness is a challenging task, reflecting the complex relationship between genotype and phenotype. Accurate assessment of proximity among sequences with different phenotypic traits depends on how strongly the chosen distance is associated with structural and functional properties. In this study, we present a new distance measure Mutual Information and Entropy H (MIH) for categorical data such as nucleotide or amino acid sequences. MIH applies an information matrix (IM), which is calculated from the data and captures heterogeneity of individual positions as measured by Shannon entropy and coordinated substitutions among positions as measured by mutual information. In general, MIH assigns low weights to differences occurring at high entropy positions or at dependent positions. MIH distance was compared with other common distances on two experimental and two simulated data sets. MIH showed the best ability to distinguish cross-immunoreactive sequence pairs from non-cross-immunoreactive pairs of variants of the hepatitis C virus hypervariable region 1 (26,883 pairwise comparisons), and Major Histocompatibility Complex (MHC) binding peptides (n = 181) from non-binding peptides (n = 129). Analysis of 74 simulated RNA secondary structures also showed that the ratio between MIH distance of sequences from the same RNA structure and MIH of sequences from different structures is three orders of magnitude greater than for Hamming distances. These findings indicate that lower MIH between two sequences is associated with greater probability of the sequences to belong to the same phenotype. Examination of rule-based phenotypes generated in silico showed that (1) MIH is strongly associated with phenotypic differences, (2) IM of sequences under selection is very different from IM generated under random scenarios, and (3) IM is robust to sampling. In conclusion, MIH strongly approximates structural/functional distances and should have important applications to a wide range of biological problems, including evolution, artificial selection of biological functions and structures, and measuring phenotypic similarity.
Campo David S, Mosa Alexander, Khudyakov Yury
2023-Jan-03
Shannon entropy, categorical variables, genetic distance, machine learning, mutual information, natural and artificial selection, phenotype, protein