ProteinGuide in Practice

This post assumes some basic familiarity with generative models, training regression models, Bayes' rule, and ProteinGuide itself For a brief introduction, check out my Intuitive Introduction to ProteinGuide.. It is intended for those with some background in machine learning who are interested in applying ProteinGuide in their research.

If you're interested in generative modeling, I post pieces that synthesize my learnings about this field on my substack. Two upcoming essays that I'm particularly excited about will investigate the origins of the probabilistic framework for generative modeling. I'll not only trace through the history of generative models in machine learning (Diffusion, GPT, GANs, VAEs, Deep Belief Networks, etc.) but will also examine the intellectual precursors of the field, including topics like the invention of Monte Carlo methods to create the atomic bomb and the invention of Markov chains to analyze Russian literature. Subscribe to my substack to get notified when these pieces come out.

What follows is a living document recording the Listgarten lab's current best practices for using ProteinGuide. The post is divided into three sections. The first section describes a conceptual framework for how ProteinGuide works in the ideal case. The second uses this framework to categorize the main failure modes you might encounter. It then covers how to address them. The final section organizes these ideas into a reasonable default workflow for using ProteinGuide in the lab

Table of Contents

Understanding ProteinGuide and its Vulnerabilities

ProteinGuide samples from a function-conditioned distribution over proteins using classifier guidance. The key equation to remember is the following application of Bayes' rule:

$$p(x \mid y) \propto p(y \mid x) \cdot p(x).$$

In this equation,

  • $p(x)$ is your prior distribution over sequencesStrictly speaking the generative model predicts the marginal probability of the next position to unmask $x_i$ given a partially masked sequence $x^{(t)}$, but we'll use $x$ as shorthand for this post. for the task. In ProteinGuide, this is instantiated as a pretrained sequence generative model.
  • $p(y \mid x)$ is your predictive model of protein function. It estimates the probabilityWhen your function of interest is a discrete variable, like a binary label for whether a sequence adopts a particular fold, the notation to the left suffices. In those cases, the predictive model will be a classifier. When the function of interest is a continuous variable, such as binding affinity, $p(y \mid x)$ is shorthand for $p(y > \tau \mid x)$, where $\tau$ is the threshold for function that you want to achieve with your library. When $y$ is continuous, the predictive model is a regression model and is normally set up to parameterize a tractable distribution over function values (e.g. predicting the mean and variance of a Gaussian). The threshold $\tau$ can then be applied later, when actually sampling, to get a probability. that a sequence $x$ can achieve your function of interest $y$.
  • $p(x \mid y)$ is what we actually want to sample from: plausible sequences that solve the task.

For more background on ProteinGuide, check out the overview on the homepage of this series of posts.

Now let's develop a conceptual framework for understanding how ProteinGuide should work in the ideal case. Consider the following schematic of the sequence space for a given design problem.

Sequences with a blue check are those that are predicted to be realistic by the generative model, and sequences with an orange check are those that are predicted to solve the task by the predictive model. The region enclosed by the dashed red line is the region of sequence space we think is relevant for our task.

In an ideal scenario, the sequence-function data from our first library is concentrated on the intersection of these two. This ensures that:

  1. we are likely to find good sequences in the next rounds,
  2. the predictive model is well-trained on the same region of sequence space that the generative model concentrates on, and
  3. the generative model doesn't have too much probability mass outside of the region that we're interested in sampling from.

When these conditions hold, ProteinGuide works well. After all, the mathematical machinery underlying the technique stems from the basic axioms of probability. If we're having trouble with ProteinGuide, it won't be because math decided to break that day. It will be because our computational pipeline doesn't adhere to our mathematical assumptions.

These assumptions are the same ones we discussed in the overview to this series of posts:

  1. the generative model must capture your prior beliefs about which sequences make sense for this task
  2. the predictive model must accurately determine, based on your wet-lab data, which sequences from your prior are most desirable.

In our experience, every time we thought we found a "new" failure mode for ProteinGuide, it turned out to stem from a violation of one or both of these assumptions:

  1. the way your generative model is set up, it doesn't actually capture all your prior knowledge about the task, often producing sequences that are clearly suboptimal for your task of interest, or
  2. the sequences produced by the generative model are out of distribution for the predictive model; equivalently, the generative model produces sequences that are very different from those in your assay labeled data.

The remainder of this post covers tips and tricks to mitigate these issues. To begin, there are two possible scenarios we must analyze.

Scenario 1: No wet-lab data yet

If you haven't run your first experimental library yet, congratulations! You can avoid these problems pretty safelyClick here to jump to our workflow that includes running the first library.. Just generate the sequences for your first library using the generative model. The main thing you need to get right is to set up the sampling problem to bake-in as much of your prior knowledge about the task as possible. Don't just use the unconditional generative model directly. Play with:

  • picking a specific region of the protein to design
  • biasing the generative model towards wildtype sequences
  • filtering the generations with pLDDT or refolding metrics
  • you can even guide with the stability predictor from our paper.

Then, to sample the second round of designs, set up the generation pipeline the same way. Just take care to plug in the guided model wherever you used the pretrained model the first time aroundIf you use the ProteinGen package, this just corresponds to, e.g. changing wherever you have ESM3 in your code to DEG(ESM3, predictive_model)..

The only caveat is that, if you filter generated sequences–such as selecting generated sequences for the library by pLDDT–you may need to finetune the generative model on the library sequences first. Otherwise the generative process will be OOD for the predictive model, since it will no longer be biased towards high pLDDT sequences like the library was.

The benefit of setting up your first library this way is that the sequences in it will be sampled from the distribution of your generative model. Recall from the Intuitive Introduction to ProteinGuide that the predictive model has to approximately forecast how the particular generative model it is guiding will complete the current partially masked sequence and what distribution over function values that will induce. Making the predictive model's data distribution match the generative model sequences it will take as input during guidance will make ProteinGuide much more likely to work for you.

The only downside is that a simple baseline for generating the initial library, like a DMS scan of the WT, might still outperform the generative model on average. This means even if ProteinGuide works, it is starting from a disadvantage compared to wildtype. We don't have a ton of anecdotal evidence on this question yet, but if you have the budget for multiple rounds of design, ProteinGuide starting with the generative model sequences should be fine. If you have a more limited budget, we would recommend first finetuning the generative model on the WT or successful sequences from a pilot mutational library before using it to generate your first library.

Scenario 2: Wet-lab data collected without the pretrained generative model

On the other hand, if you already have some data collected, don't worry, that's what the rest of this guide is for. We'll walk you through how to reason about setting up ProteinGuide to work for your use-case. The section on Library-Generative Model Mismatch will be particularly relevant.

Common Problems

"Reward Hacking"

In this scenario, the predictive model does not know about some sequence bias in your dataset that is a necessary precondition for success. This commonly includes bias to the wild-type sequence. The predictive model might not know that sequences in other protein families cannot solve your task. We can see this in the schematic as the presence of false positives outside the dashed red line.

This is normally fine if your generative model constrains the space of relevant sequences strongly enough. But if the generative model is too flexible, it might produce "successes" outside the reasonable region that are unlikely to work. These are the circled points in the following diagram.

Solutions:

  • Restrict generation to only allow the model to design a small region of the protein
  • Finetune the generative model on homologs
  • Finetune the generative model on the first library

Common gotchas:

  • If you're too aggressive with these techniques guidance might not do anything because the sequences from the generative model no longer cover a large enough range of function for the predictive model to guide towards the good ones. You can always try to increase guidance temperature to solve this, but we've found that it's better to give the model more residues to design or increase regularization towards the original weights when doing finetuning (i.e., using a lower LoRA rank).
  • If you finetune on homologs that are in-distribution for the model, the model is unlikely to change how it ranks them. It will just make your protein family of interest more likely than all other protein families. This is sufficient to address reward hacking, but is just a common assumption people have when they train on homologs–that the model will now learn new fitness information as well. We have not found the latter to be a reliable phenomenon.
  • If you train on the initial library, there is a chance that the pretrained model overfits and forgets information it knows about the natural distribution of sequences. To guard against this, use a low LoRA rank and monitor the model's performance on a held-out set of natural sequences. This is related to the first problem above of having too little variance in the generative model's sequences.

Library-Generative Model Mismatch

In this scenario, your first experimental library was not created by the generative model. Consequently, there is a mismatch between the predictive model's training distribution and the sequences the generative model will actually create. Similarly, the guided sequences might end up being too different from your training data.

When this happens, the predictive model's behavior on the generative model's distribution becomes unpredictable. You can usually spot this failure mode by looking for very volatile $p(y \mid x_t)$ curves during the denoising process, or by noticing large disagreements between the predictive models only trained on unmasked sequences and the predictive model used for guidance when they score the guided output sequences.

Solutions:

  • Employ some form of uncertainty quantification while figuring out the best approach. We've found that predicting per-sample variance and creating an ensemble of these predictors trained on different subsets of the data is helpful for quantifying your aleatoric and epistemic uncertainty. Train on replicate data directly if possible.
  • Finetune the generative model on the successful sequences from the first library, provided that the best sequences in that initial library are actually good enough to build on top of. Check that samples from this model now have lower uncertainty and higher fitness under the predictive model.
  • Guide the pretrained model as is, but make sure to bring its sequences in-distribution for the noisy predictor. This can be accomplished by self-distillation of the predictive model on partially masked sequences from your generative model. In this setup, the predictive model will learn to predict the labels it gave the unmasked sequences using the masked inputs. If you've created an ensemble to get uncertainty estimates, sample several labels for the predictive model to train on.

Predictive Models on Masked Sequences Can Be Very Biased

In this scenario, we must confront the reality that noisy classification is a computationally hard task. To solve it fully, the predictive model has to implicitly predict the generative model's completions for each partially masked sequence.

Because this is so difficult, predictive models will often fail by having their predictions collapse onto the population mean, or by having the overall variance of their predictions contract significantly.

Solutions:

  • Initialize the noisy predictive model with the clean model's weights. Then, freeze the weights of the noisy predictive model that take non-masked features as input. Only allow the weights for masked inputs and hidden/output layers to be updated during training. This way, the noisy predictive model is unlikely to artificially decrease MSE at high masking rates by simply learning a bias towards the population mean.

Common gotchas:

  • If you try to solve this problem by simultaneously learning a predictor for all noise levels by training multiple separate predictors, differences in how they model the denoising process will often cause them to guide the pretrained model in conflicting ways, killing the success rate.

Generative Model Doesn't Have Coverage

This is a more theoretical issue. There is a small chance that you somehow figure out there is a specific region of sequence space that is very likely to solve your task, but for which the generative model does not assign very high probability.

Solutions:

  • Finetune your generative model on algorithmically generated or heavily filtered sequences from that specific region. Changing the predictive model instead might be tricky, since you likely do not want to destroy the correct function-related features it already recognizes.

A Practical Workflow for ProteinGuide

Putting everything together, the following is a simple default workflow for using ProteinGuide in the lab. For further implementation details, check out the ProteinGuide Python Workflow on ProteinGen.

  1. If possible, generate your first library using the pretrained generative model. Otherwise finetune the model on a subset of good sequences from your first library.
  2. In either case, make sure to bake in as much of your prior knowledge about the task as possible into the generative model. This includes inverse folding and filtering with AlphaFold whenever possible.
  3. Try several predictive model architectures by training them on just the unmasked sequences. At this point just focus on the basic prediction task, no uncertainty quantification. IMPORTANT: until step 8, always train two sets of predictive models, an oracle that sees all the data, and a regular model which doesn't see the test split.
  4. Using the best one, introduce uncertainty quantification.
    • Create an ensemble of mean predictors using random train-splits from your first library.
    • Use these ensembles to score their validation splits and use those predictions to train a variance prediction network
    • Try parameterizing a Cholesky decomposition of the covariance matrix to get a full covariance estimate for your predictions if you have enough data.
  5. Train a predictive model for masked sequences, being careful to avoid collapse of the mean predictions.
  6. Guide with DEG if your predictive model is lightweight or non-differentiable. Otherwise, try TAG.
  7. Sweep temperature parameters and target property thresholds to find a set of parameters that has the highest probability of success under your oracle.
  8. Guide with the best set of parameters!

Glossary

Oracle

An oracle predictor is a predictive model trained on all available labeled data (including what would normally be held out for evaluation), and which typically cannot make predictions for partially masked sequences. It is our in silico gold standard for modeling the relevant sequence-function relationship. We use it for in-silico ranking and calibration checks, not for reporting generalization.

To implement an oracle using the ProteinGen package, see the Training the oracle section of the Predictor Training Guide.

DEG

DEG is the exact classifier-guidance method for any-order discrete generative models described in our paper. Algorithmically, it specifies one site to unmask next, uses the predictive model to evaluate $p(y\mid\tilde{x})$ for each candidate mutant $\tilde{x}$, and combines these with pretrained generative-model predictions via Bayes' rule. For formal details and notation, see the ProteinGuide paper.

To use DEG for conditional generation using the ProteinGen package, see the Step 4: Combine with TAG or DEG section of the ProteinGuide Workflow.

TAG

TAG is an approximate classifier-guidance method that uses the predictive model's gradient on a one-hot input to approximate the Bayes-rule likelihood ratio. Compared to DEG, TAG often trades exactness for lower latency, but can be less stable because it is approximate. For formal details and notation, see the ProteinGuide paper.

To use TAG for conditional generation using the ProteinGen package, see the Step 4: Combine with TAG or DEG section of the ProteinGuide Workflow.