Continued Pretraining¶
Specialize a pretrained generative model to a protein family using homologous sequences, optionally with predicted structures for inverse folding.
Overview¶
This workflow is for when you want a better base model for your protein family before doing any property-guided generation. It's a common first step in ProteinGuide and useful on its own for unconditional generation of family-like sequences.
Step 1: Acquire homologs¶
Obtain an MSA of sequences homologous to your protein of interest.
→ See MSA Acquisition
Step 2: Build the dataset¶
Strip gaps, filter by length, and optionally fold with AF3 for structure conditioning.
→ See MSA → Dataset
Step 3: Fine-tune with LoRA¶
LoRA fine-tune ESM3 or ESMC on the homolog sequences.
→ See Fine-tuning Generative Models
Key decisions:
- Sequence-only vs. inverse folding — use sequence-only if you don't have structures or want faster training. Use inverse folding if you have AF3-predicted structures and want the model to learn the structure→sequence mapping for your family.
- LoRA rank —
r=4is a good default for specialization. Higher ranks give more capacity but risk overfitting on small MSAs. - Train to convergence — for continued pretraining, you want the model to fully learn the family distribution. Watch the likelihood curves plateau.
Step 4: Evaluate¶
Compare pretrained vs. fine-tuned models using likelihood curves on held-out homologs.
→ See Likelihood Curves
What to look for:
- Fine-tuned model should have higher log-probs on held-out homologs at all noise levels
- If using structure conditioning, the gap between struct-conditioned and seq-only should widen after fine-tuning
- Sequence-only log-probs should stay roughly flat (no memorization)