Skip to content

Training a Probe + Guided Sampling (TrpB)

Architecture Breakdown

Data: TrpB fitness landscape from SaProtHub (HuggingFace dataset) with continuous fitness labels → data.

Models:

  • ESMC — pretrained generative model (no fine-tuning) → generative_modeling
  • LinearProbe on cached ESMC embeddings — trained as the predictive model → predictive_modeling. Uses precompute_embeddings for fast training → Training Predictors
  • DEG(ESMC, probe) — combines them via enumeration-based guidance → guide

Sampling: sample (discrete-time ancestral) — DEG automatically passes position info → sampling

Evaluation: Guided vs. unguided fitness comparison → evaluation

End-to-end example: train a linear probe on the TrpB fitness landscape using cached ESMC embeddings, then use it for guided sampling with enumeration-based guidance (DEG).

Quick Start

uv run python examples/trpb_linear_probe.py --device cuda

What This Demonstrates

  1. Loading a HuggingFace dataset (SaProtHub/Dataset-TrpB_fitness_landsacpe)
  2. Training a LinearProbe on cached embeddings from ESMC
  3. Combining the probe with ESMC via DEG for guided generation
  4. Evaluating guided vs. unguided samples

How Guidance Works

Discrete Expert-based Guidance (DEG) uses the probe as an "expert" to bias the generative model's sampling toward sequences predicted to have high fitness. At each unmasking step, the model's token probabilities are reweighted by the probe's predictions.

See the ProteinGuide workflow for the full guided sampling framework.

Source: examples/trpb_linear_probe.py