evaluation¶
Tools for sanity-checking your pipeline at each stage: data quality, model fidelity, generation diversity, and structural plausibility.
Likelihood Curves¶
The primary evaluation tool for generative models. Measures how well a model predicts masked amino acids as context is progressively revealed.
from proteingen.eval import compute_log_prob_trajectory, plot_log_prob_trajectories
traj = compute_log_prob_trajectory(sequences, model, n_time_points=20)
plot_log_prob_trajectories([traj], ["ESMC-300M"], "likelihood.png")
For fixed-order teacher-forced decoding diagnostics (log p of the true token at the position currently being decoded):
from proteingen.eval import (
compute_decoding_log_prob_trajectory,
plot_decoding_log_prob_trajectories,
)
traj = compute_decoding_log_prob_trajectory(sequences, model, orders)
plot_decoding_log_prob_trajectories([traj], ["ESMC-300M"], "decode_likelihood.png")
At each noise level $t \in [0, 1)$, positions are randomly unmasked with probability $t$, and the model's average log $p(x_\text{true})$ at the remaining masked positions is recorded. This produces a curve from "no context" (left) to "full context" (right).
What to look for:
- Higher is better — a fine-tuned model should have higher log-probs than the pretrained baseline on in-distribution sequences
- Structure conditioning boost — structure-conditioned models should show uniformly higher curves than sequence-only models
- Overfitting — if the fine-tuned model's curve on held-out sequences drops below the pretrained model, you've overfit
See the likelihood curves workflow for detailed usage and interpretation.
Oracle Scoring¶
Score generated sequences with a separately-trained oracle model to estimate how well guidance worked. The oracle should be trained on all available data (including later rounds if available) and is never used during sampling.
# Score generated library with a separately-trained oracle
oracle_preds = oracle.predict(generated_sequences)
Coming soon
Convenience functions for oracle scoring, threshold analysis, and round-over-round improvement tracking.
Predictor–Oracle Agreement¶
Before trusting a noisy predictor during guided sampling, check that it agrees with the oracle on clean (fully unmasked) sequences. Low agreement means the predictor is unreliable during generation.
from scipy.stats import spearmanr
oracle_scores = oracle.predict(val_sequences)
predictor_scores = noisy_predictor.predict(val_sequences)
rho, _ = spearmanr(oracle_scores, predictor_scores)
print(f"Agreement: ρ = {rho:.3f}") # want ρ > 0.7 ideally
Coming soon
predictor_oracle_agreement(oracle, predictor, sequences) — returns correlation metrics and generates scatter plots.
Diversity Metrics¶
Assess whether generated libraries have sufficient sequence diversity to be useful for experimental screening.
Key metrics:
- Sequence identity to wildtype — are variants too similar to the starting point?
- Pairwise sequence identity — are generated sequences diverse from each other?
- Mutational distance distribution — how many mutations from wildtype?
- Positional entropy — is diversity spread across positions or concentrated?
Coming soon
sequence_diversity(sequences, wildtype) — computes all diversity metrics in one call.
Structural Validation¶
For critical applications, validate that generated sequences fold into the intended structure.
Approaches:
- AlphaFold 3 — fold generated sequences and compare predicted structures to the target backbone (using pTM, pLDDT, or TM-score)
- ESM3 structure tokens — quick structural plausibility check without full folding
Coming soon
Integration with af3-server for batch structure prediction of generated libraries.
API Reference¶
proteingen.eval
¶
Evaluation utilities.
DecodingLogProbTrajectory
¶
Bases: TypedDict
Per-sequence teacher-forced decode trajectories.
list[(n_steps_s,)] — per-sequence fraction of positions
already unmasked before each decode step.
decoded_position_log_probs: list[(n_steps_s,)] — log p(true token) at the position decoded at each step, in the same order.
Source code in src/proteingen/eval/likelihood_curves.py
LogProbTrajectory
¶
Bases: TypedDict
Result of compute_log_prob_trajectory.
time_points: (n_time_points,) — fraction of positions unmasked at each step. avg_log_probs: (n_sequences, n_time_points) — per-sequence average log p(x_true) at masked positions. NaN where a sequence had no masked positions.
Source code in src/proteingen/eval/likelihood_curves.py
PropertyTrajectory
¶
Bases: TypedDict
Result of property tracking during generation.
time_points: (n_time_points,) — fraction of positions unmasked. p_y_gt_t: (n_sequences, n_time_points) — probability of exceeding threshold at each step.
Source code in src/proteingen/eval/property_curves.py
compute_decoding_log_prob_trajectory
¶
compute_decoding_log_prob_trajectory(sequences: list[str], model: GenerativeModel, orders: list[LongTensor], batch_size: int = 32) -> DecodingLogProbTrajectory
Teacher-forced log p(true token) along fixed decoding orders.
Each sequence starts with all order positions masked. At decode step k, the model predicts the true residue at order[k] given the partially unmasked context from earlier steps (order[:k]). This returns one trajectory per sequence at native step resolution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
protein sequences to evaluate. |
required |
model
|
GenerativeModel
|
a GenerativeModel with a tokenizer mask token. |
required |
orders
|
list[LongTensor]
|
one token-position order per sequence (same tokenizer space as model). |
required |
batch_size
|
int
|
number of decode steps to score per forward pass. |
32
|
Returns:
| Type | Description |
|---|---|
DecodingLogProbTrajectory
|
DecodingLogProbTrajectory with per-sequence step fractions and log probs. |
Source code in src/proteingen/eval/likelihood_curves.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
compute_log_prob_trajectory
¶
compute_log_prob_trajectory(sequences: list[str], model: GenerativeModel, n_time_points: int, batch_size: int = 32) -> LogProbTrajectory
Compute average log-probability trajectories under progressive unmasking.
For each of n_time_points evenly-spaced noise levels t in [0, 1), randomly masks each sequence position with probability (1 - t), then measures the model's average log p(true token) at the masked positions.
At t ≈ 0: nearly everything is masked (little context → low log prob). At t ≈ 1: nearly everything is revealed (rich context → high log prob).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
protein sequences to evaluate. |
required |
model
|
GenerativeModel
|
a GenerativeModel (e.g. ESMC wrapped with MaskedModelLogitFormatter). |
required |
n_time_points
|
int
|
number of evenly-spaced noise levels to evaluate. |
required |
batch_size
|
int
|
sequences per forward pass. |
32
|
Returns:
| Type | Description |
|---|---|
LogProbTrajectory
|
LogProbTrajectory with time_points and per-sequence avg log probs. |
Source code in src/proteingen/eval/likelihood_curves.py
plot_decoding_log_prob_trajectories
¶
plot_decoding_log_prob_trajectories(trajectories: list[DecodingLogProbTrajectory], labels: list[str], output_path: str | Path, show_individual: bool = False, max_individual: int = 20, n_grid_points: int = 100, title: str = 'Teacher-forced decoding log-likelihood') -> None
Plot teacher-forced decode trajectories on a normalized x-axis.
Sequence lengths differ, so each per-sequence curve is linearly interpolated to a shared [0, 1] grid (percent unmasked) before model-level aggregation.
Source code in src/proteingen/eval/likelihood_curves.py
plot_log_prob_trajectories
¶
plot_log_prob_trajectories(trajectories: list[LogProbTrajectory], labels: list[str], output_path: str | Path, show_individual: bool = True, max_individual: int = 200, title: str = 'Log-likelihood trajectory under progressive unmasking') -> None
Plot one or more log-probability trajectories on a single figure.
Each trajectory is drawn as a mean ± std band, optionally with individual sequence curves behind it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectories
|
list[LogProbTrajectory]
|
list of LogProbTrajectory dicts to plot. |
required |
labels
|
list[str]
|
display name for each trajectory (must match len(trajectories)). |
required |
output_path
|
str | Path
|
file path for the saved plot. |
required |
show_individual
|
bool
|
if True, draw faint per-sequence lines behind the mean. |
True
|
max_individual
|
int
|
cap on per-sequence lines drawn per trajectory. |
200
|
title
|
str
|
plot title. |
'Log-likelihood trajectory under progressive unmasking'
|
Source code in src/proteingen/eval/likelihood_curves.py
compute_property_trajectory_from_sampling
¶
Extract property trajectories from a SamplingTrajectory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
SamplingTrajectory
|
A SamplingTrajectory obtained with record_p_y_gt_t=True. |
required |
Returns:
| Type | Description |
|---|---|
PropertyTrajectory
|
PropertyTrajectory with time_points (>0 to 1) and p_y_gt_t. |