Skip to main content

Deep Mind's AI for Protein Structure Is Coming to the Masses

Machine-learning systems with far-reaching potential application in medicine, agriculture, materials science and more from Google sister-company DeepMind and from a rival academic group are now open source and freely accessible.

The structure of human interleukin-12 protein bound to its receptor, as predicted by machine-learning software,Ian Haydon, UW Medicine Institute for Protein Design

It’s protein-structure prediction for the people. Software that accurately determines the 3D shape of proteins is set to become widely available to scientists.

On 15 July, the London-based company DeepMind released an open-source version of its deep-learning neural network AlphaFold 2 and described its approach in a paper in Nature1. The network dominated a protein-structure prediction competition last year.

Meanwhile, an academic team has developed its own protein-prediction tool inspired by AlphaFold 2, which is already gaining popularity with scientists. That system, called RoseTTaFold, performs nearly as well as AlphaFold 2, and is described in a Science paper also published on 15 July2.

The open-source nature of the tools means that the scientific community should be able to build on the advances to create even more powerful and useful software, says Jinbo Xu, a computational biologist at the University of Chicago in Illinois, who was not involved in either effort.

Structure to function

Proteins are made of strings of amino acids that, when folded into 3D shapes, determine the function of those proteins in cells. For decades, researchers have used experimental techniques such as X-ray crystallography and cryo-electron microscopy to determine protein structures. But such methods can be time-consuming and costly, and some proteins are not amenable to such analysis.

DeepMind sent shock waves through the scientific world last year, when it showed that its software could accurately predict the structure of many proteins using the sequence of the proteins alone (which is determined by DNA). Researchers had been working on this challenge for decades, and AlphaFold 2 performed so well in a biennial protein-prediction exercise called CASP that the competition’s co-founder declared that “in some sense the problem is solved”.

DeepMind — which has a reputation for being cagey about its work — described AlphaFold 2 in a brief presentation at CASP on 1 December. It promised to publish a paper outlining the network in more detail and to make the software available to researchers, but said little else.

“Among academics, there was a fair amount of doom and gloom,” says David Baker, a biochemist at the University of Washington in Seattle whose team developed RoseTTaFold. “If someone has solved the problem you’re working on but doesn’t disclose how they did it, how do you continue working on it?”

“I felt like I lost my job at the time,” says computational chemist Minkyung Baek, a member of Baker’s team. But DeepMind’s presentation also spurred new ideas that Baek couldn’t wait to explore. So she, Baker and their colleagues started brainstorming ways to replicate AlphaFold 2’s success.

If you like this article, please sign up for Snapshot, Portside's daily summary.

(One summary e-mail a day, you can change anytime, and Portside is always free.)

They identified several key advances, including how the network uses information about proteins that are evolutionarily related to the targets researchers are trying to predict, and how the predicted structures of one part of a protein can influence how the network handles sequences corresponding to other parts of the molecule.

RoseTTaFold not only performed nearly as well as AlphaFold 2 — but also much better than other CASP entries (including some from the Baker lab). It’s not yet clear why it couldn't equal AlphaFold 2, but one possibility is DeepMind’s expertise, says Baek. “We don’t have any deep-learning engineers in our lab.” Xu is impressed by the efforts of Baek, Baker and their collaborators, and suspects that DeepMind’s success was down to its access to engineering expertise and superior computing power.

Speedy structures

DeepMind has also streamlined AlphaFold 2. Whereas the network took days of computing time to generate structures for some entries to CASP, the open-source version is about 16 times faster, says AlphaFold lead researcher John Jumper. It can generate structures in minutes to hours, depending on the size of the protein. That’s comparable to the speed of the RoseTTaFold.

Although the source code for AlphaFold 2 is freely available — including to commercial entities — it might not yet be particularly useful for researchers without technical expertise. DeepMind has collaborated with select researchers and organizations, including the non-profit Drugs for Neglected Diseases initiative headquartered in Geneva, Switzerland, to predict specific targets, but it hopes to broaden access, says Pushmeet Kohli, head of AI for science at DeepMind. “There is a lot more we want to do in this space.”

As well as making the code for RoseTTaFold freely available, Baker’s team has set up a server into which researchers can plug a protein sequence and get a predicted structure. Since it was launched last month, the server has predicted the structure of more than 5,000 proteins submitted by around 500 people, says Baker.

With code now freely available for both RoseTTaFold and AlphaFold 2, researchers will be able to build on both advances, says Xu, and perhaps make the techniques amenable to protein structures that AlphaFold 2 has so far struggled to predict. Two areas of intense interest are predicting the structure of complexes of multiple interacting proteins and applying the software to the design of new proteins.

More by Ewen Callaway

doi: https://doi.org/10.1038/d41586-021-01968-y

References

  1. Jumper, J. et al. Nature https://doi.org/10.1038/s41586-021-03819-2 (2021).

    Article Google Scholar 

  2. Baek, M. et al. Science https://doi.org/10.1126/science.abj8754 (2021).

    Article Google Scholar