DPLM-2¶
ByteDance's discrete diffusion protein language model (DPLM-2, ICLR'25). Uses masked diffusion over an extended vocabulary that includes both amino acid and structure codebook tokens. Currently supports sequence-only mode.
- Output dim: 8229 (33 AA tokens + 8196 structure tokens — formatted by
MaskedModelLogitFormatterto expose only the 20 standard AAs + mask) - Embedding dim: 640 (150m), 1280 (650m), 2560 (3b) — set dynamically from model weights
- LoRA support: yes, via
apply_lora() - Structure conditioning: not yet supported (joint sequence+structure generation requires upstream's structure VQ-VAE tokenizer)
Available checkpoints¶
HuggingFace hub (airkingbd/):
| Checkpoint | Params | Hidden | Layers |
|---|---|---|---|
airkingbd/dplm2_150m |
150M | 640 | 30 |
airkingbd/dplm2_650m |
650M | 1280 | 33 |
airkingbd/dplm2_3b |
3B | 2560 | 36 |
from proteingen.models import DPLM2
model = DPLM2("airkingbd/dplm2_650m").cuda() # default checkpoint
log_probs = model.get_log_probs_from_string(["ACDEFGHIK"])
DPLM-2 works with the same sampling, guidance, and probe infrastructure as the ESM models:
from proteingen.sampling import sample_ctmc_linear_interpolation
init_tokens = model.tokenizer(["<mask>" * 100], return_tensors="pt")["input_ids"].cuda()
sequences = sample_ctmc_linear_interpolation(model, init_tokens, n_steps=100)
Untied embedding weights
The HuggingFace config for DPLM-2 incorrectly sets tie_word_embeddings=True. The DPLM2 wrapper overrides this to False before loading — if you load the model manually via AutoModelForMaskedLM, you'll get wrong logits.