MSA Acquisition¶
Tools and strategies for obtaining multiple sequence alignments of homologous proteins.
Coming soon
This module will provide convenience wrappers or links to established tools. For now, we describe the landscape.
Sequence-based homolog search¶
Standard tools for finding proteins with similar sequences:
| Tool | Description | Link |
|---|---|---|
| MMseqs2 | Fast sequence search — orders of magnitude faster than BLAST | github.com/soedinglab/MMseqs2 |
| jackhmmer | Iterative HMM search against UniRef/UniProt | Part of HMMER |
| ColabFold MSA server | Free web API for fast MSA generation | colabfold.com |
When to use sequence-based: Default choice. Works well when your protein has identifiable homologs in UniRef. Most protein families have thousands of homologs.
Structure-based homolog search¶
For proteins with few sequence homologs (e.g. de novo designs, orphan proteins), search by structural similarity:
| Tool | Description | Link |
|---|---|---|
| Foldseek | Fast structural search against AlphaFold DB or PDB | github.com/steineggerlab/foldseek |
When to use structure-based: When sequence search returns too few hits (< 100), or when you want to include remote homologs that share fold but not sequence. Combine the structural hits with sequence-based hits for a richer MSA.
From MSA to dataset¶
Once you have your MSA (as a FASTA file), proceed to MSA → Dataset to prepare it for training.