Easy to state

While highly successful, processes that use physical tools to identify protein structures are expensive and time consuming, even with modern techniques such as cryo-electron microscopy. As such, the vast majority of protein structures—and the effects of disease-causing mutations on these structures—are still largely unknown.

Computational methods that calculate how proteins fold have the potential to dramatically reduce the cost and time needed to determine structure. But the problem is difficult and remains unsolved after nearly four decades of intense effort.

A visual simulation of how the model learns. It trains itself by repeatedly predicting a structure (colored) and comparing its prediction to the ground truth structure (grey). This is repeated for thousands of known proteins, with the model learning and improving its accuracy with every iteration. Animation: Mohammed AlQuraishi.

Proteins are built from a library of 20 different amino acids. These act like letters in an alphabet, combining into words, sentences and paragraphs to produce an astronomical number of possible texts. Unlike alphabet letters, however, amino acids are physical objects positioned in 3D space. Often, sections of a protein will be in close physical proximity but be separated by large distances in terms of sequence, as its amino acid chains form loops, spirals, sheets and twists.

“What’s compelling about the problem is that it’s fairly easy to state: take a sequence and figure out the shape,” AlQuraishi said. “A protein starts off as an unstructured string that has to take on a 3D shape, and the possible sets of shapes that a string can fold into is huge. Many proteins are thousands of amino acids long, and the complexity quickly exceeds the capacity of human intuition or even the most powerful computers.”

Hard to solve

To address this challenge, scientists leverage the fact that amino acids interact with each other based on the laws of physics, seeking out energetically favorable states like a ball rolling downhill to settle at the bottom of a valley.

The most advanced algorithms calculate protein structure by running on supercomputers—or crowd-sourced computing power in the case of projects such as Rosetta@Home and Folding@Home—to simulate the complex physics of amino acid interactions through brute force. To reduce the massive computational requirements, these projects rely on mapping new sequences onto predefined templates, protein structures previously determined through experiment.

Other projects such as Google’s AlphaFold have generated tremendous recent excitement by using advances in artificial intelligence to predict a protein’s structure. To do so, these approaches parse enormous volumes of genomic data, which contain the blueprint for protein sequences. They look for sequences across many species that have likely evolved together, using such sequences as indicators of close physical proximity to guide structure assembly.

These AI approaches, however, do not predict structures based solely on a protein’s amino acid sequence. Thus, they are limited in their ability to determine structures of proteins for which there is no prior knowledge, evolutionary unique proteins or novel proteins designed by humans.

Training deeply

To develop a new approach, AlQuraishi applied so-called end-to-end differentiable deep learning. This branch of artificial intelligence has dramatically reduced the computational power and time needed to solve problems such as image and speech recognition, enabling applications such as Apple’s Siri and Google Translate.

In essence, differentiable learning involves a single, enormous mathematical function—a much more sophisticated version of a high school calculus equation—arranged as a neural network, with each component of the network feeding information forward and backward.

This function can tune and adjust itself, over and over at unimaginable levels of complexity, in order to “learn” precisely how a protein sequence mathematically relates to its structure.

AlQuraishi developed a deep-learning model, termed a recurrent geometric network, which focuses on key characteristics of protein folding. But before it can make new predictions, it must be trained using previously determined sequences and structures.

A visual simulation of how the model calculates the angles of the bonds between amino acids, and angle of rotation around those bonds, to assemble the geometry of a protein structure. Animation: Mohammed AlQuraishi.

For each amino acid, the model predicts the most likely angle of the chemical bonds that connect the amino acid with its neighbors. It also predicts the angle of rotation around these bonds, which affects how any local section of a protein is geometrically related to the entire structure.

This is done repeatedly, with each calculation informed and refined by the relative positions of every other amino acid. Once the entire structure is completed, the model checks the accuracy of its prediction by comparing it against the “ground truth” structure of the protein.

This entire process is repeated for thousands of known proteins, with the model learning and improving its accuracy with every iteration.

New vista

Once his model was trained, AlQuraishi tested its predictive power. He compared its performance against other methods from several recent years of the Critical Assessment of Protein Structure Prediction— an annual experiment that tests computational methods for their ability to make predictions using protein structures that have been determined but not publicly released.

He found that the new model outperformed all other methods at predicting protein structures for which there are no preexisting templates, including methods that use co-evolutionary data. It also outperformed all but the best methods when preexisting templates were available to make predictions.