Autoregressive Generation (ProGen3)¶
Generate novel protein sequences using ProGen3's autoregressive (left-to-right) sampling. Unlike masked models that iteratively unmask positions, ProGen3 generates one amino acid at a time from N-terminal to C-terminal.
Quick Start¶
from proteingen.models import ProGen3
from proteingen import sample
model = ProGen3("Profluent-Bio/progen3-112m").cuda()
# Fixed-length: 5 proteins of 100 residues each
init_x = ["<mask>" * 100 for _ in range(5)]
result = sample(model, init_x, in_order="left_to_right")
print(result["sequences"])
Fixed-Length Generation via sample()¶
Use <mask> tokens as placeholders and sample() fills them left-to-right:
from proteingen.models import ProGen3
from proteingen import sample
model = ProGen3("Profluent-Bio/progen3-112m").cuda()
init_x = ["<mask>" * 200 for _ in range(10)]
result = sample(model, init_x, in_order="left_to_right")
for i, seq in enumerate(result["sequences"]):
print(f"Protein {i}: {len(seq)} residues")
print(f" {seq[:80]}...")
This uses the same sample() interface as masked models — the AutoregressiveLogitFormatter ensures only the next unfilled position gets non-trivial logits at each step.
Open-Ended Generation (Variable Length)¶
Let the model decide when to stop:
result = model.generate(
n=10,
max_new_tokens=512,
temperature=0.8,
top_p=0.95,
)
for i, seq in enumerate(result["sequences"]):
print(f"Protein {i}: {len(seq)} residues")
Prompted Generation (Sequence Completion)¶
Provide an N-terminal prefix and let the model complete the protein:
result = model.generate(
prompt="MKTLLLTLVVVTIVCLD",
n=5,
max_new_tokens=512,
)
assert all(seq.startswith("MKTLLLTLVVVTIVCLD") for seq in result["sequences"])
Scoring Sequences¶
Evaluate how likely a sequence is under the model. ProGen3 scores bidirectionally (N→C and C→N averaged) for a more robust estimate:
sequences = [
"MKTLLLTLVVVTIVCLDLGYAAQSEGSSRQLIAAIGAICGAILLNYTFNQEIAQ",
"ACDEFGHIKLMNPQRSTVWY",
]
scores = model.score(sequences)
print("Log-likelihoods:", scores["log_likelihood"])
print("Perplexities:", scores["perplexity"])
How It Works¶
ProGen3 is a causal language model — it factorizes the probability of a protein as:
$$P(\text{seq}) = \prod_{i=1}^{L} P(a_i \mid a_1, \ldots, a_{i-1})$$
Each token is sampled from the model's predicted distribution given all previous tokens. The model uses a sparse mixture-of-experts (MoE) architecture.
When used with sample(), the AutoregressiveLogitFormatter adapts this to the framework's position-aligned convention by shifting logits left by one (so logits[i] = prediction for position i) and blocking all positions except the first unfilled one.