DeepMind open-sources protein structure dataset generated by AlphaFold 2

All the sessions from Transform 2021 are available on-demand now. Watch now.

DeepMind and the European Bioinformatics Institute (EMBL), a life sciences lab based in Hinxton, England, today announced the launch of what they claim is the most complete and accurate database of structures for proteins expressed by the human genome. In a joint press conference hosted by the journal Nature, the two organizations said that the database, the AlphaFold Protein Structure Database, which was created using DeepMind’s AlphaFold 2 system, will be made available to the scientific community in the coming weeks.

The recipe for proteins — large molecules consisting of amino acids that are the fundamental building blocks of tissues, muscles, hair, enzymes, antibodies, and other essential parts of living organisms — are encoded in DNA. It’s these genetic definitions that circumscribe their three-dimensional structures, which in turn determine their capabilities. But protein “folding,” as it’s called, is notoriously difficult to figure out from a corresponding genetic sequence alone. DNA contains only information about chains of amino acid residues and not those chains’ final form.

DeepMind AlphaFold 2 database

Above: A tuberculosis protein structure predicted by AlphaFold 2.

Image Credit: DeepMind

In December 2018, DeepMind attempted to tackle the challenge of protein folding with AlphaFold, the product of two years of work. Its successor, AlphaFold 2, announced in December 2020, improved on this to outgun competing protein-folding-predicting methods. In the results from the 14th Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 had average errors comparable to the width of an atom (or 0.1 of a nanometer), competitive with the results from experimental methods.

“The AlphaFold database shows the potential for AI to profoundly accelerate scientific progress. Not only has DeepMind’s machine learning system greatly expanded our accumulated knowledge of protein structures and the human proteome overnight, its deep insights into the building blocks of life hold extraordinary promise for the future of scientific discovery,” Alphabet and Google CEO Sundar Pichai said in a press release.

Illuminating protein structures

AlphaFold 2 draws inspiration from the fields of biology, physics, and machine learning, taking advantage of the fact that a folded protein can be thought of as a “spatial graph” where amino acid residues (amino acids contained within a peptide or protein) are nodes, and edges connect the residues in close proximity. AlphaFold 2 leverages an AI algorithm that attempts to interpret the structure of this graph while reasoning over the implicit graph it’s building, using evolutionarily related sequences, multiple sequence alignment, and a representation of amino acid residue pairs.

In an open source codebase published last week, DeepMind significantly streamlined AlphaFold 2. Whereas the close-sourced system took days of computing time to generate structures, the open source version is about 16 times faster and can produce structures in minutes to hours, depending on the protein size.

These improvements enabled DeepMind and the EMBL to create more than than 350,000 protein structure predictions including the human proteome (which spans 20,000 proteins), more than doubling the number of high-accuracy structures available to researchers. Beyond this, DeepMind and EMBL used AlphaFold 2 to predict the structures of 20 other “biologically significant organisms,” yielding over 350,000 structures in total for E. coli, fruit flies, mice, zebrafish, yeast, malaria parasites, tuberculosis bacteria, and more. The plan is to expand coverage to over 100 million structures as improvements to both AlphaFold 2 and the database come online.

DeepMind AlphaFold 2 database

Above: AlphaFold 2’s prediction of a malaria parasite protein.

Image Credit: DeepMind

“This will be one of the most important datasets since the mapping of the Human Genome,” EMBL deputy director general Ewan Birney said in a statement. “Making AlphaFold 2 predictions accessible to the international scientific community opens up so many new research avenues, from neglected diseases to new enzymes for biotechnology and everything in between. This is a great new scientific tool, which complements existing technologies, and will allow us to push the boundaries of our understanding of the world.”

Some scientists caution that AlphaFold 2 isn’t likely the end-all be-all when it comes to protein structure prediction. Steven Finkbeiner, professor of neurology at the University of California, San Francisco, told Wired in an interview that it’s too soon to tell the implications for drug discovery, given the wide variation in structures within the human body. But DeepMind makes the case that AlphaFold 2, if further refined, could be applied to previously intractable problems, including those related to epidemiological efforts. Last year, the company predicted several protein structures of SARS-CoV-2, including ORF3a, whose makeup was formerly a mystery.

DeepMind protein dataset

Above: A yeast protein, once again predicted by AlphaFold 2.

Image Credit: DeepMind

DeepMind says it’s committed to making AlphaFold 2 available “at scale” and collaborating with partners to explore new frontiers, like how multiple proteins form complexes and interact with DNA, RNA, and small molecules. Earlier this year, the company announced a partnership with the Geneva-based Drugs for Neglected Diseases Initiative, a nonprofit pharmaceutical organization that hopes to use AlphaFold to identify compounds to treat conditions for which medications remain elusive. The Centre for Enzyme Innovation is using the system to help engineer faster enzymes for recycling polluting single-use plastics. And teams at the University of Colorado Boulder and the University of California, San Francisco are studying antibiotic resistance and SARS-CoV-2 biology with AlphaFold 2.

“Proteins are like tiny exquisite biological machines. The same way that the structure of a machine tells you what it does, so the structure of a protein helps us understand its function. Proteins are like tiny exquisite biological machines. The same way that the structure of a machine tells you what it does, so the structure of a protein helps us understand its function,” DeepMind CEO Demis Hassabis wrote in a blog post published today. “At DeepMind, our thesis has always been that artificial intelligence can dramatically accelerate breakthroughs in many fields of science, and in turn advance humanity. We built AlphaFold and the AlphaFold Protein Structure Database to support and elevate the efforts of scientists around the world in the important work they do. We believe AI has the potential to revolutionise how science is done in the 21st century, and we eagerly await the discoveries that AlphaFold might help the scientific community to unlock next.”


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Leave a Comment