Fine-tuning ESM3 on EphB1 (Sequence-only MLM)¶
Architecture Breakdown
Data: ~10k EphB1 kinase domain homologs from a UniRef MSA, loaded as a ProteinDataset → data. Uses uniform_mask_noise + uniform_time for training-time masking. MSA was obtained via jackhmmer → MSA Acquisition, then cleaned → MSA → Dataset.
Models: ESM3 with LoRA adapters (r=4) → generative_modeling. This is an instance of the Fine-tuning module.
Sampling: None (training only).
Evaluation: Training loss/perplexity tracked per epoch. Likelihood curves would be the natural next step → Likelihood Curves.
Fine-tune ESM3 with LoRA on ~10k EphB1 kinase domain homologs using masked language modeling.
Acknowledgement¶
We thank Kosuke Seki for providing the EphB1 MSA dataset used in this example.
Citation: Seki, K. et al. (2025). bioRxiv preprint. https://doi.org/10.1101/2025.08.03.668353 (link).
Quick Start¶
What It Does¶
This trains the model to predict randomly masked amino acids from surrounding sequence context. The training data is an MSA of EphB1 kinase domain homologs (~10k sequences, 200–295 residues).
Results after 5 epochs: loss 1.80 → 1.60, perplexity 6.04 → 4.96.
Key Details¶
- Uses LoRA adapters (not full fine-tuning) to keep memory usage manageable
- Requires
--ampflag (bfloat16) — ESM3 fp32 logits overflow without it - Training uses
model(input_ids, **observations)directly, notget_log_probs
See the Fine-tuning workflow for details on the training loop and LoRA setup.