New foundation model “Evo” unlocks sequence modeling and design at the genomic scale



summary
Summary

A team from TogtherAI and the Arc Institute presents Evo, an AI model for biological research that can interpret DNA, RNA, and proteins and enable generative design at the molecular and genomic level.

Developed by a team of experts consisting of Eric Nguyen, Michael Poli, Matthew Durrant, Patrick Hsu and Brian Hie, the model represents a milestone in the processing and analysis of biological data. Using a modified version of the StripedHyena architecture, Evo is unique in its ability to interpret the fundamental biological “languages” – DNA, RNA, and proteins – to make predictions and enable generative design from the molecular to the genomic level.

The new architecture enables Evo to model long contexts and process more than 650,000 tokens. This is particularly important for biological AI models because DNA sequences can be extremely long (up to billions of nucleotides) and high sensitivity is required to understand the effects of evolution based on single nucleotide changes. Evo works at the nucleotide level, recognizing and interpreting the smallest building blocks of DNA and RNA. Evo can process sequences up to 131 kilobases (131,000 bases) in length.

“Evo tries to show a path forward toward unified and foundation modeling on biology,” says Michael Poli, co-author of Evo and StripedHyena. Like language models, Evo uses a next-token prediction objective, which is the prediction of the next token during training – in this case at the nucleotide level. “The problem up until now, why this hasn’t been done, is that sequences are extremely long if you want to capture meaningful properties about DNA and also learning at high resolution is quite challenging for transformers,” says Poli. He is alluding to tokenizers, which convert text into tokens in language models, for example, and are often responsible for issues in LLM performance because they usually do not work at the character level, but rather convert parts of words or multiple numbers into a token.

Ad

Ad

The team was also able to reproduce this in their experiments when training Transfomer models and other architectures such as Mamba. “Well, the amazing thing is that these deep signal processing architectures seem to scale better,” Poli says. “It’s not just that they can process these longer sequences and then do about as well as transformers. It’s as if they scale better per flop. They’re just better architectures, I believe, than transformers.”

Evo is a foundation model for biology

Evo was trained on a large database of 2.7 million prokaryotic genomes, a fraction of the publicly available genomic data. The model was trained in two stages. In the first phase, it was trained with a context length of 8,000 base pairs; in the second phase, the context length was increased to 131,000 base pairs. This allows the model to recognize patterns and make predictions about a much longer DNA sequence than previous methods. The corresponding training dataset, OpenGenome, will be made publicly available shortly.

Early experiments with Evo show the potential for several applications, including predicting an organism’s vital genes based on small DNA mutations. This capability could replace traditional laboratory experiments, which the team says can often take months.

Image: Nguyen, Poli, Durrant et al.

In tests, it was able to compete with leading protein-specific language models to predict the effects of mutations on the function of E. coli proteins. Evo can also predict the functional properties of non-coding RNAs (ncRNAs) and infer gene expression from regulatory DNA.

In addition, Evo can generate complex molecular systems such as CRISPR-Cas complexes and transposable elements. Evo can also generate DNA sequences longer than 650 kilobases, an order of magnitude larger than previous methods. In addition, while previous generative models typically focus on a single modality, Evo is capable of designing large functional complexes of proteins and ncRNAs.

Recommendation

GitHub.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top