Training a Probe + Guided Sampling (TrpB)¶
Architecture Breakdown
Data: TrpB fitness landscape from SaProtHub (HuggingFace dataset) with continuous fitness labels → data.
Models:
- ESMC — pretrained generative model (no fine-tuning) → generative_modeling
- LinearProbe on cached ESMC embeddings — trained as the predictive model → predictive_modeling. Uses
precompute_embeddingsfor fast training → Training Predictors - DEG(ESMC, probe) — combines them via enumeration-based guidance → guide
Sampling: sample (discrete-time ancestral) — DEG automatically passes position info → sampling
Evaluation: Guided vs. unguided fitness comparison → evaluation
End-to-end example: train a linear probe on the TrpB fitness landscape using cached ESMC embeddings, then use it for guided sampling with enumeration-based guidance (DEG).
Quick Start¶
What This Demonstrates¶
- Loading a HuggingFace dataset (
SaProtHub/Dataset-TrpB_fitness_landsacpe) - Training a
LinearProbeon cached embeddings from ESMC - Combining the probe with ESMC via DEG for guided generation
- Evaluating guided vs. unguided samples
How Guidance Works¶
Discrete Expert-based Guidance (DEG) uses the probe as an "expert" to bias the generative model's sampling toward sequences predicted to have high fitness. At each unmasking step, the model's token probabilities are reweighted by the probe's predictions.
See the ProteinGuide workflow for the full guided sampling framework.
Source: examples/trpb_linear_probe.py