Intuitive Introduction to ProteinGuide

This post provides an intuitive introductionFor readers with a statistics or machine learning background, we provide some formal definitions and equations in the margins, but feel free to skip these. to protein sequence generative models and ProteinGuide. This is part of our series of resources for using ProteinGuide, you can find the index page for this series here.

ProteinGuide uses a predictive model of protein fitness to guideSpecifically, ProteinGuide is a form of 'classifier guidance' for discrete generative models. a pretrained sequence generative model. Generative model diagram If you're unfamiliar with any of these terms, they will be explained below. At a high level, generative models design sequences, predictive models tell us which ones will work, and ProteinGuide gives a recipe for how to combine them to get high fitness sequences in a statistically sound way.

To understand how ProteinGuide works, we first need to understand how generative models operate. From there, we will layer in how predictive models enable guidance. Finally, we will discuss how the predictive model is trained and what the end-to-end ProteinGuide algorithm looks like.

Discrete Generative Models in a Nutshell

Generative models are trained to "fill-in-the-blank". As input, they take protein sequences where some positions are masked and they predict the missing amino acids. This allows them to design sequences by iteratively inserting amino acids into the masked positions, until the sequence is fully unmasked.

An important detail is that they don't directly output the amino acid to be inserted. Instead, they look at all the 20 amino acids and assign each of them a probability of being the correct amino acid for that position. We then look at these probabilities and use them to randomly sample an amino acid to actually insert.

Timeseries of the Masked Generation Process/

When we talk about generative models here, we're referring to any-order autoregressive models (or equivalently masked language, discrete diffusion, or discrete flow matching models).

We model a protein as a sequence $x$ of amino acids $\mathcal{A}$ of length $L$ (i.e. $x \in \mathcal{A}^L$). Let $M(x, t):(\mathcal{A}^L \times [0, 1]) \rightarrow \{\mathcal{A} \cup \text{<mask>}\}^L$ be a function that randomly masks each position in $x$ with probability $t$. The generative model, typically parameterized by a decoder-only transformer, estimates the marginal distribution $P(x_i|M(x, t))$ at all masked positions $x_i$, conditioned on a partially masked input sequence. We denote the distribution induced by the generative model's predictions $P_\theta$.Because of this, the generative model is not deterministic. Below is a video showing ESMC generating 8 different sequences in parallel. Observe that even though they all start with the same masked sequence, each sequence ends up being different.

This non-determinism is a very important property. It is exactly what allows us to use the generative model to create a library of diverse proteins. Using a generative model, the chance that we produce a given variant is proportional to how good the model thinks it is. But how does the model decide? Above I said that the model

look[s] at all the 20 amino acids and assign[s] each of them a probability of being the correct amino acid for that position.

So what is this "probability of being correct"?

Imagine that we roll out all possible sequences that the model could generate from a given partially masked sequence. To keep things simple let's assume our "protein" is three amino acids long, is only made up of Leucine (L) and Tryptophan (W), and that we're unmasking it from left to right. We can now map out all the possible sequences we could generate as a tree. At each node, the model must decide which amino acid to add next. Each choice it makes moves it down the tree until it reaches a complete sequence at the end.

For our toy problem, we'll say that any sequence with more Ws than Ls is a "real" protein. We'll label each of these with a blue checkmark.

Our goal is to create a generative model that is trained to assign all real proteins equal probabilityWe want equal probability because this generative model is not for any specific task.. The generative model is typically trained using an ELBO and can be sampled from in a variety of ways. In the main text, we describe sampling from the model using an any-order autoregressive procedure. It turns out that it is really simple to define this ideal model. Consider the node circled in red. All the model has to do is look downstream of a choice it might make and count how many real proteins it can generate from that point. Then it takes each of those counts and divides by the total number of real proteins it can generate from that point. If the model adds an L (left-hand branch in the tree), there is only one real protein it can make, which is WLW. If it adds a W, there are two real proteins it can make: WWL and WWW. Therefore, the model assigns a probability of $1/3$ to inserting an L and $2/3$ to inserting a W.

Conditional Generation using Predictive Models

Now let's say, for our design task, we want a protein that has at least one L. Of course, the protein has to be "real", so this requirement stacks on top of the first. We can mark the "real" sequences with at least one L with an orange checkmark, and box the resulting high fitness sequences that have at least one L in purple.

Tree of possible sequences with high fitness sequences boxed

In the same way that the generative model picks out the "realistic" proteins in blue, the model that picks out our high fitness proteins is referred to as the property-conditioned generative model or conditional generative modelConditional generative model meaning a generative model that is taking into additional information about the desired protein at generation time. In this case, it is 'conditioning' on the fact that we want at least one L—it builds character, lol.. Note that this conditional generative model predicts different probabilities for selecting each branch at the circled node. Compared to the regular unconditional generative model, the probability of adding a W went down from $2/3$ to $1/2$ under the conditional generative model, and the probability of adding an L went up from $1/3$ to $1/2$.

In ProteinGuide, we want to generate sequences as if we had trained a conditional generative model—a task-specific model tuned to your design goal–but without actually training a new generative model from scratch. The way we do this is to first train a predictive modelWLOG, for a partially masked sequence $x^{(t)}=M(x, t)$, the predictive model estimates the probability of a property $y=f(x)$ of the sequence being above some threshold, $p(y>\tau|x)$. Here, $f$ is the "true" sequence-function relationship that we seek to optimize. Generative model diagram that predicts how likely the generative model is to create a high fitness sequence starting from a given sequence. In other words, it counts the proportion of sequences with the blue checkmarks that also have orange ones. We then use this model during sampling to adjust the generative model's predictions for which amino acid to insert in such a way that the resulting probabilities become the same as a conditional generative model for our data. In other words, we implicitly construct a conditional generative model using the predictive model and generative model. We refer to this as the guided generative model to contrast it conceptually with having to train a genuine conditional sequence generative model for your task.

In ProteinGuide, the guided predictions for the next amino acid come from multiplying the predictive model's probability for getting a high fitness variant with the generative model's probability for that amino acid resulting in a realistic protein and renormalizingFormally, $\color{purple}{p(x|y)} \color{black}{\propto} \color{orange}{p(y|x)} \color{black}{\cdot} \color{blue}{p(x)}$, where $\color{blue}{p(x)}$ is the distribution of sequences induced by the pretrained generative model, $\color{orange}{p(y|x)}$ is the distribution of fitness conditioned on sequence as estimated by the predictive model, and $\color{purple}{p(x|y)}$ is the distribution of sequences conditioned on fitness that we want to sample from. Note that at each generation step, we are only considering 20 amino acids, so we can compute the normalizing factor above directly.. Informally, we can write this as:

$$\color{purple}{\text{Guided Model}} \color{black}{=} \color{darkorange}{\text{Predictive Model}} \color{black}{\times} \color{blue}{\text{Generative Model}}\color{black}{.}$$

Recall that each of these models is really solving a specific counting problem. The model counts how many sequences downstream of the node in our tree corresponding to the current masked sequence satisfy particular conditions; the conditions differ between the models. It turns out that each of these counting problems can be expressed as a fraction, allowing us to see that, under the hood, ProteinGuide is actually doing a very simple, intuitive operation

$$\color{purple}{\frac{\text{# high fitness}}{\text{# possible sequences}}} \color{black}{=} \color{darkorange}{\frac{\text{# high fitness proteins}}{\text{# proteins}}} \color{black}{\times} \color{blue}{\frac{\text{# proteins}}{\text{# possible sequences}}}\color{black}{.}$$

in order to make sure we only generate high fitness variants from the sequence landscape.

Training the Predictive Model

So how do you train the predictive model? It's pretty simple to train one that takes in completed sequences and outputs a predictive functional value like its fluorescence or catalytic activity. The predictive model used in guidance has one twist. In order to guide the generative model, it needs to predict the probability of success (i.e. satisfying our fitness criteria) for a partially masked sequence. But what does that mean? A partially masked sequence doesn't have a real fitness.

As we saw in the tree diagrams above, when the predictive model estimates how good a partially masked sequence is, it is predicting the probability that a randomly sampled sequence the generative model creates starting from this partially masked sequence will have high fitness.

Now this sounds pretty complicated, but it can be achieved with a relatively simple data collection and training procedure. First, collect assay-labeled data for sequences generated by your unconditional generative modelIf this is not possible, for whatever reason, you can still use ProteinGuide, but there are some nuances you should be aware of. These considerations are covered in detail in my post on ProteinGuide in Practice. Then set up the predictive model training like any other regression or classification problem where it needs to predict the fitness from the sequence. The only thing you have to change is that every time you sample a sequence to train on, randomly mask out some of the positions. What this does is start with the completed sequence and then randomly sample one of its parent nodes in the tree. Over many samples the model will learn that many final sequences share certain parents and it will learn to average their fitness at that parent node accordingly.

The Basic ProteinGuide Algorithm

Putting it all together, the way ProteinGuide works is:

Pick a sequence generative model you like (including inverse folding models).
Select a region you want to design and mask it out.
Sample a first library of sequences from the generative model and assay them for your property of interest.
Train a predictive model to estimate the fitness of the sequences in your first library from their partially masked versions.
Guide your pretrained generative model using the predictive model to get a second library.
Assay the second library and repeat steps 4-6 until you have a library of satisfactory fitness!