Microsoft Research Blog

The Dayhoff Atlas: scaling sequence diversity improves protein design

Published

Kevin K. Yang, Sarah Alamdari, Alex J. Lee, Kaeli Kaymak-Loveless, Samir Char, Garyk Brixi, Carles Domingo-Enrich, Chentong Wang, Suyue Lyu, Nicolo Fusi, Philip Rosenfield, Neil Tenenholtz, Ava P. Amini

“The body of data available in protein sequences is something fundamentally new in biology and biochemistry: unprecedented in quantity, in concentrated information content, and in conceptual simplicity.” – Margaret Oakley Dayhoff

Public databases of protein sequence and structure have enabled scientists to better understand protein biology and to build powerful computational models to further that understanding. It all began in 1965, when Margaret Dayhoff published the Atlas of Protein Sequence and Structure, which collated the 65 proteins whose amino acid sequences were then known. Today, we are excited to pay tribute to that legacy with the release of the Dayhoff Atlas (opens in new tab) (Fig. 1), a centralized collection of both protein sequence data and generative models designed to serve as a modern-day resource for protein biology in the age of AI.

The Dayhoff Atlas dramatically expands the scale and diversity of publicly available protein data by providing the largest open dataset of natural proteins to date, GigaRef, and a first-in-class, large-scale dataset of synthetic proteins, BackboneRef. Using these data and sets of evolutionarily-related sequences, we trained the Dayhoff family of protein language models, including the first model that combines single proteins and sets of evolutionarily-related sequences at scale. Today, we make the Dayhoff Atlas – including all datasets, models, Dayhoff-generated sequences (DayhoffRef), (opens in new tab) and code (opens in new tab) – fully open, to empower scientists to advance our ability to understand and design proteins (Fig. 1). The Dayhoff models will additionally be featured in the Azure AI Foundry (opens in new tab), where they can be deployed by developers and enterprises to accelerate their own innovation.

Figure 1: The Dayhoff Atlas of datasets and models for protein sequence generation.
Figure 1: The Dayhoff Atlas of datasets and models for protein sequence generation.

Expanding the scale and diversity of natural protein data

The Dayhoff models are protein language models (PLMs) that learn the amino-acid language to generate new protein sequences. Before the Dayhoff Atlas, most PLMs were trained on protein sequences derived from the genomes of various organisms. However, this restricts training to organisms from which we can derive whole genomes, excluding a vast diversity of biology across diverse evolutionary lineages. Through the recovery of genetic material directly from field samples, metagenomics enables the characterization of genomes and proteomes of organisms from diverse contexts such as the gut microbiome, oceanic surveys, and soil samples. We turned to this technique to expand the scale and diversity of natural protein sequences for PLM training.

We integrated metagenomic- and genomic-derived sequences to produce GigaRef, which contains over 3.34 billion protein sequences and is the largest open dataset of proteins to date (Fig. 2A). GigaRef combines, deduplicates, and clusters the largest database of genomic-derived sequences, UniRef, with 8 metagenomic databases to create a single, unified dataset of natural protein sequences. GigaRef provides a ~16x increase in the total number of sequences relative to UniRef90 and a ~24x increase in the number of clusters relative to UniRef50 (UniRef March 2025 release).

Distilling structural and evolutionary information into protein sequence space

While GigaRef offers great diversity in protein sequences, protein structure provides semantically rich, complementary information. Ideally, protein language models could capitalize on both to improve the quality and fidelity of the proteins they design.

We distilled structural information into sequence space by creating BackboneRef: a large-scale dataset of 240,811 synthetic structural backbones and corresponding structure-based synthetic amino acid sequences (Fig. 2B). BackboneRef’s structures contain 83,121 new folds that are not present in natural proteins. By predicting amino-acid sequences that would fold into these structures, we generated synthetic protein sequences that provide novel, structure-based data for training the Dayhoff models.

In addition to the above, we processed over 16 million pre-computed multiple sequence alignments (MSAs) to obtain sets of evolutionarily-related proteins – “homologs” – for training the Dayhoff models. These sets of evolutionarily-related proteins provide structural and evolutionary information to the Dayhoff models.

Figure 2: (A) Overview of GigaRef. (B) Overview of BackboneRef.
Figure 2: (A) Overview of GigaRef. (B) Overview of BackboneRef.

Unifying single-sequence and evolutionary modeling

Leveraging the datasets of the Dayhoff Atlas, we trained a single generative PLM that unified learning over single sequences and evolutionarily-related homologs. We achieved this by “unrolling” homologs into a series of sequences, divided by special separators. This unrolling necessitated a model architecture that could accommodate long context lengths, so we used a hybrid architecture that combines transformer attention, which offers explicit “lookup” abilities, and state-space models, which provide efficient long-context abilities. Our most-performant 3-billion parameter model, Dayhoff-3b-GR-HM-c, is trained on natural sequences from GigaRef and on sets of homologs. It can accurately predict mutation effects, design high-quality individual proteins, and perform guided generation of new sequences by conditioning on evolutionarily-related proteins.

Increasing protein expression rates in the wet lab

How do dataset choice and model scale affect the quality of proteins generated by the Dayhoff model? To answer this question, we trained a suite of Dayhoff models on different combinations of the Dayhoff Atlas datasets and evaluated them across a range of tasks to obtain a multi-faceted view into model performance.

One of the most important metrics of the quality of a protein language model is whether it generates sequences that can be produced by cells in the lab, as this reflects the biological plausibility and stability of the synthetic proteins. In the first study of its kind, we generated sequences from different Dayhoff models and tested them head-to-head in the lab, measuring whether they could be produced – “expressed” – by E. coli bacteria (Fig. 3).

Learning on GigaRef yielded a small downstream increase in the fraction of expressed proteins, with 34.5% of proteins from Dayhoff-170m-GR expressing successfully versus 27.6% from Dayhoff-170m-UR90. Increasing model and dataset scale further improved the expression rate to 35.7% for Dayhoff-3b-GR-HM-c. Most notably, augmenting training with structure-based synthetic data from BackboneRef produced the highest expression success rate, with 51.7% proteins from Dayhoff-170m-UR50-BRn expressing, a 1.875-fold increase from Dayhoff-170m-UR90.

Figure 3: Model scale, metagenomic data, and structure-based augmentation increase the cellular expression rates of Dayhoff-generated proteins.
Figure 3: Model scale, metagenomic data, and structure-based augmentation increase the cellular expression rates of Dayhoff-generated proteins.

These results show that macroscopic design choices – like training dataset composition and model size – yield measurable differences in real-world protein expression, which is not only a readout of quality but also often the first step to functional characterization of a synthetic protein. In the future, principled choices about training data composition could help shift proteins generated by PLMs to exhibit desired macroscale properties like expressibility.

What’s next

The Dayhoff Atlas (opens in new tab) is a first-in-class, centralized, fully open resource for protein sequence data and language models. We are excited to see the community build on this Atlas to advance protein science and engineering, carrying on Margaret Dayhoff’s legacy.

Access The Dayhoff Atlas

Links to all resources associated with the Dayhoff Atlas:

Acknowledgements

The Dayhoff Atlas is the result of a highly collaborative effort across Microsoft Research, including contributions from many of our past research interns. The full authors: Kevin K. Yang, Sarah Alamdari, Alex J. Lee, Kaeli Kaymak-Loveless, Samir Char, Garyk Brixi, Carles Domingo-Enrich, Chentong Wang, Suyue Lyu, Nicolo Fusi, Neil Tenenholtz, Ava P. Amini. We thank Philip Rosenfield, Hannah Richardson, and Sean Whitzell for their great help and support in this work.

Continue reading

See all blog posts