We have known, since 1953, that the DNA molecule encodes the genetic information that transmits characteristics from ancestors to descendants, in all types of lifeforms on Earth. Genes, in the DNA sequences, specify the primary structure of proteins, the sequence of amino acids that are the components of the proteins, the cellular machines that do the jobs required to keep a cell alive. The secondary structure of proteins specifies some of the ways a protein folds locally, in structures like alpha helices and beta sheets. Methods that can determine reliably the secondary structure of proteins have existed for some time. However, determining the way a protein folds globally in space (its tertiary structure, the shape it assumes) has remained, mostly, an open problem, outside the reach of most algorithms, in the general case.
The Critical Assessment of protein Structure Prediction (CASP) competition, started in 1994, took place every two years since then and made it possible for hundreds of competing teams to test their algorithms and approaches in this difficult problem. Thousands of approaches have been tried, to some success, but the precision of the predictions was still rather low, especially for proteins that were not similar to other known proteins.
A number of different challenges have taken place over the years in CASP, ranging from ab-initio prediction to the prediction of structure using homology information and the field has seen steady improvements, over time. However, the entrance of DeepMind into the competition upped the stakes and revolutionized the field. As DeepMind itself reports in a blog post, the program AlphaFold 2, a successor of AlphaFold, entered the 2020 edition of CASP and managed to obtain a score of 92.4%, measured in the Global Distance Test (GDT) scale, which ranges from 0 to 100. This value should be compared with the value 58.9% obtained by AlphaFold (the previous version of this year’s winner) in 2018, and the 40% score obtained by the winner of the 2016 competition.
Even though details of the algorithm have still not been published, the information provided in the DeepMind post provides enough information to realize that this result is a very significant one. Although the whole approach is complex and the system integrates information from a number of sources, it relies on an attention-based neural network, which is trained end-to-end to learn which amino acids are close to each other, and at which distance.
Given the importance of the problem on areas like biology, medical science and pharmaceutics, it is to be expected that this computational approach to the problem of protein structure determination will have a significant impact in the future. Once more, rather general machine learning techniques, which have been developed over the last decades, have shown great potential in real world problems.