Skip to content

Data Splits

Split assay-labeled data into train, validation, and test sets for predictive model training and evaluation.

Coming soon

This module is under development. The strategies below describe the intended functionality.

Why splitting matters

Random splits can be misleading for protein fitness prediction because nearby mutations are correlated. A model that memorizes local neighborhoods will look good on a random split but fail on distant variants — which are exactly what you want to design.

Split strategies

By mutational distance

Group variants by their Hamming distance from wildtype. Train on singles/doubles, evaluate on higher-order mutants. This tests whether the predictor can extrapolate to more distant parts of sequence space.

By activity range

Split so that the test set covers the high-activity tail — the regime you care about for design. If the predictor's ranking in this tail is poor, guidance will be unreliable.

By position

Hold out all variants at certain positions. Tests whether the predictor can generalize to unseen positions — important when your library design targets positions not in the training data.

Sensitivity analysis

Train predictors on each split and compare their rankings on the held-out set. If rankings are unstable across splits, the predictor is unreliable for guidance — consider collecting more data or using a simpler model.