Training a Probe + Guided Sampling (TrpB)¶

Architecture Breakdown

Data: TrpB fitness landscape from SaProtHub (HuggingFace dataset) with continuous fitness labels → data.

Models:

ESMC — pretrained generative model (no fine-tuning) → generative_modeling
LinearProbe on cached ESMC embeddings — trained as the predictive model → predictive_modeling. Uses precompute_embeddings for fast training → Training Predictors
DEG(ESMC, probe) — combines them via enumeration-based guidance → guide

Sampling: sample (discrete-time ancestral) — DEG automatically passes position info → sampling

Evaluation: Guided vs. unguided fitness comparison → evaluation

End-to-end example: train a linear probe on the TrpB fitness landscape using cached ESMC embeddings, then use it for guided sampling with enumeration-based guidance (DEG).

Quick Start¶

uv run python examples/trpb_linear_probe.py --device cuda

What This Demonstrates¶

Loading a HuggingFace dataset (SaProtHub/Dataset-TrpB_fitness_landsacpe)
Training a LinearProbe on cached embeddings from ESMC
Combining the probe with ESMC via DEG for guided generation
Evaluating guided vs. unguided samples

How Guidance Works¶

Discrete Expert-based Guidance (DEG) uses the probe as an "expert" to bias the generative model's sampling toward sequences predicted to have high fitness. At each unmasking step, the model's token probabilities are reweighted by the probe's predictions.

See the ProteinGuide workflow for the full guided sampling framework.

Source: examples/trpb_linear_probe.py