In Protein science : a publication of the Protein Society
We describe the Predicting Protein Compound Interactions (PrePCI) database which comprises over 5 billion predicted interactions between nearly 7 million chemical compounds and 19,797 human proteins. PrePCI relies on a proteome-wide database of structural models based on both traditional modeling techniques and the AlphaFold Protein Structure Database. Sequence and structural similarity-based metrics are established between template proteins, T, in the Protein Data Bank that bind compounds, C, and query proteins in the model database, Q. When these metrics pass a sequence threshold value, it is assumed that C also binds to Q with a likelihood ratio derived from machine learning. If the relationship is based on structure, this LR is based on a scoring function that measures the extent to which C is compatible with the binding site of Q as described in the LT-scanner algorithm. For every predicted complex derived in this way, chemical similarity based on the Tanimoto coefficient identifies other small molecules that may bind to Q. A likelihood ratio for the binding of C to Q is obtained from Naïve Bayesian statistics. The PrePCI database can be queried by entering a UniProt ID or gene name for a protein to obtain a list of compounds predicted to bind to it along with associated LRs. Alternatively, entering an identifier for the compound outputs a list of proteins it is predicted to bind. Specific applications of the database to lead discovery, drug mechanism of action elucidation and biological function annotation are described. This article is protected by copyright. All rights reserved.
Trudeau Stephen J, Hwang Howook, Mathur Deepika, Begum Kamrun, Petrey Donald, Murray Diana, Honig Barry
2023-Feb-13
Protein compound interactions, chemical similarity, protein-compound database, structural alignment