Skip to content

DPLM-2

ByteDance's discrete diffusion protein language model (DPLM-2, ICLR'25). Uses masked diffusion over an extended vocabulary that includes both amino acid and structure codebook tokens. Currently supports sequence-only mode.

  • Output dim: 8229 (33 AA tokens + 8196 structure tokens — formatted by MaskedModelLogitFormatter to expose only the 20 standard AAs + mask)
  • Embedding dim: 640 (150m), 1280 (650m), 2560 (3b) — set dynamically from model weights
  • LoRA support: yes, via apply_lora()
  • Structure conditioning: not yet supported (joint sequence+structure generation requires upstream's structure VQ-VAE tokenizer)

Available checkpoints

HuggingFace hub (airkingbd/):

Checkpoint Params Hidden Layers
airkingbd/dplm2_150m 150M 640 30
airkingbd/dplm2_650m 650M 1280 33
airkingbd/dplm2_3b 3B 2560 36
from proteingen.models import DPLM2

model = DPLM2("airkingbd/dplm2_650m").cuda()  # default checkpoint
log_probs = model.get_log_probs_from_string(["ACDEFGHIK"])

DPLM-2 works with the same sampling, guidance, and probe infrastructure as the ESM models:

from proteingen.sampling import sample_ctmc_linear_interpolation

init_tokens = model.tokenizer(["<mask>" * 100], return_tensors="pt")["input_ids"].cuda()
sequences = sample_ctmc_linear_interpolation(model, init_tokens, n_steps=100)

Untied embedding weights

The HuggingFace config for DPLM-2 incorrectly sets tie_word_embeddings=True. The DPLM2 wrapper overrides this to False before loading — if you load the model manually via AutoModelForMaskedLM, you'll get wrong logits.