Genome Foundation Models

Genome Foundation Models | Andrew Carroll

Skip to the content.

This blog discusses genome foundation models. The first part of the blog provides more of an explanation for the concepts of a foundation model and is designed to introduce the topic. The second part of the blog shares some opinions on current protein and genome language models.

What is a Foundation Model

A foundation model learns a broad set of knowledge that is useful in an array of various tasks. The idea is that developers can build on top of that model to solve problems better and more quickly. In addition, a foundation model may itself be directly capable of tasks, and so users may directly use the foundation model as a user.

The ability of a foundation model to serve either purpose is a function of the information learned by the model. What is learned by a model is an interaction between:

The architecture and training machinery of the model

The data that is shown to the model during training

The task that the training process requires the model to solve.

As a model is trained to solve a task, the process of training organizes the information in the network, and encodes knowledge that is required to solve the task effectively.

Walking through an example task

Let’s train a model with the following task. Given a set of 9 nucleotides, report the amino acids encoded by the reading frame. E.g. if given ATGAGGGTT, report MRV, if given TTGAGGGTT, report nothing as the first codon is not start. To solve this task, the model must learn the rules of the genetic code.

I’ve trained a simple 3-layer network to solve this problem. Training models like this involves showing a series of labelled examples and updating the weights of the model based on whether a prediction is correct. Training progresses by showing the data in an “epoch” and over training the model progressively learns how to predict examples.

Below I am showing how two examples, (MW-STOP) and (MAQ) are predicted over the course of the training run. I am showing the activation of neurons on their connection, as well as the training epoch and the overall accuracy on the hold-out test set of the model.

Transferring knowledge from a foundation model to a new model

In order to solve this task, our model has to learn rules of translation - how the triplet code corresponds to amino acids. Let’s see how this information can be useful to help solve different, but related tasks.

For this task, let’s predict the hydrophobicity score of the amino acid sequence, by summing up the hydrophobicity of each amino acid encoded (if any). So for the amino acids M (+1.9), L (+3.8), S (-0.8) we should predict 4.9. Solving this task is a bit harder, because we still have to solve the codon rules, and then we have to map the amino acids onto their properties.

Here, I’ll show you two different networks. In the first network on the left, we are going to train the model “from scratch” - meaning directly on the nucleotide data. In the second model on the right, we are going to take the connections learned in the network from the first task as a layer of pre-trained embeddings. The concept of embeddings is important, embeddings represent either directly or in a compressed form the information learned in some part of the network. Our new network is going to learn from that network how to classify the task.

Like before, I’ll show you the network for two examples. Notice how the model on the right with the embeddings learns to solve the problem faster and with an overall better accuracy.

What are the advantages of the foundation model approach

This approach is most useful when your ability to encode knowledge into the network is limited compared to what was used to train the foundation model. Since that knowledge is a function of training data and task, here are the most common places that would be the case:

When you are solving a more narrow task, and although there is useful general knowledge in training for diverse tasks, you prefer to focus your training setup on the task you care about.

When you are data limited. You may have a handful of problems in our domain, and are unable to train a model with broad knowledge.

When you are compute constrained, and want to use your resources more efficiently.

To illustrate these advantages, we can look at some training curves. The curve below shows the evaluation accuracy in training a model on 50 examples. The blue curve is model with pre-trained embeddings. You can see how it can rapidly focus learning on the hydrophobicity problem, while the from scratch model in red has to learn both aspects of the problem. To show the importance of the prior knowledge, in the orange curve I took the same architecture of the model with embeddings, but randomized the weights, basically erasing the information in the network. This model has the same architecture as the embeddings model, but without the prior information does not give the performance boost.

To...

Genome Foundation Models

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy