Skip to content

Continued Pretraining

Specialize a pretrained generative model to a protein family using homologous sequences, optionally with predicted structures for inverse folding.

Overview

Acquire homologs → Build dataset → Fine-tune with LoRA → Evaluate with likelihood curves

This workflow is for when you want a better base model for your protein family before doing any property-guided generation. It's a common first step in ProteinGuide and useful on its own for unconditional generation of family-like sequences.


Step 1: Acquire homologs

Obtain an MSA of sequences homologous to your protein of interest.

→ See MSA Acquisition

Step 2: Build the dataset

Strip gaps, filter by length, and optionally fold with AF3 for structure conditioning.

→ See MSA → Dataset

Step 3: Fine-tune with LoRA

LoRA fine-tune ESM3 or ESMC on the homolog sequences.

→ See Fine-tuning Generative Models

Key decisions:

  • Sequence-only vs. inverse folding — use sequence-only if you don't have structures or want faster training. Use inverse folding if you have AF3-predicted structures and want the model to learn the structure→sequence mapping for your family.
  • LoRA rankr=4 is a good default for specialization. Higher ranks give more capacity but risk overfitting on small MSAs.
  • Train to convergence — for continued pretraining, you want the model to fully learn the family distribution. Watch the likelihood curves plateau.

Step 4: Evaluate

Compare pretrained vs. fine-tuned models using likelihood curves on held-out homologs.

→ See Likelihood Curves

What to look for:

  • Fine-tuned model should have higher log-probs on held-out homologs at all noise levels
  • If using structure conditioning, the gap between struct-conditioned and seq-only should widen after fine-tuning
  • Sequence-only log-probs should stay roughly flat (no memorization)