ArXiv Preprint
This paper demonstrates that language models are strong structure-based
protein designers. We present LM-Design, a generic approach to reprogramming
sequence-based protein language models (pLMs), that have learned massive
sequential evolutionary knowledge from the universe of natural protein
sequences, to acquire an immediate capability to design preferable protein
sequences for given folds. We conduct a structural surgery on pLMs, where a
lightweight structural adapter is implanted into pLMs and endows it with
structural awareness. During inference, iterative refinement is performed to
effectively optimize the generated protein sequences. Experiments show that our
approach outperforms the state-of-the-art methods by a large margin, leading to
up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65% and 56.63% on
CATH 4.2 and 4.3 single-chain benchmarks, and >60% when designing protein
complexes). We provide extensive and in-depth analyses, which verify that
LM-Design can (1) indeed leverage both structural and sequential knowledge to
accurately handle structurally non-deterministic regions, (2) benefit from
scaling data and model size, and (3) generalize to other proteins (e.g.,
antibodies and de novo proteins)
Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei YE, Quanquan Gu
2023-02-03