Methods · Benchmarking
GRIMM: Genetic Stratification for Inference in Molecular Modeling
Abstract
Protein function models are routinely reported on random train/test splits. In sequence space a random split is a near-duplicate split: homologs land on both sides, so the test set measures interpolation among relatives rather than generalization to new biology. GRIMM replaces random partitioning with homology-aware stratification. Sequences are clustered at controlled identity, whole clusters are assigned to a single fold, and performance is reported as a curve over the identity gap between test sequences and their nearest training neighbor. The protocol turns a single optimistic scalar into a generalization profile, and it makes low sequence diversity (the dominant failure mode for enzyme-function prediction) visible and quantifiable.
keywords: out-of-distribution · homology leakage · clustered cross-validation · enzyme function · benchmarking
## The leakage problem
Let D be a labeled set of sequences and s(a,b) the pairwise identity between two of them. A random split draws test set T uniformly from D. Because protein datasets are dense with homologs, for almost every test sequence there exists a training sequence with high s:
The reported metric is then dominated by sequences the model has effectively already seen. A model can score well by memorizing family-level motifs and never demonstrate that it has learned anything that transfers to a novel fold or an under-sampled clade. The headline number is real; it just answers the wrong question.
## Stratification protocol
GRIMM enforces a separation constraint between folds. Cluster D at identity threshold τ so that any two sequences above τ share a cluster, then partition whole clusters, never individual sequences, across folds:
This guarantees the maximum test-to-train identity is bounded by τ, removing the leakage in (1) by construction. Sweeping τ sweeps difficulty: a high threshold tests near-family generalization, a low threshold tests transfer across distant sequence space. The pipeline is deterministic and reproducible:
# GRIMM: homology-aware folds from a sequence set
clusters = mmseqs_cluster(seqs, min_seq_id=tau) # single-linkage at identity tau
folds = assign_clusters(clusters, k=5, # whole clusters -> folds
balance="label") # keep label strata even
# guarantee: no test sequence within tau identity of any train sequence
assert max_cross_fold_identity(folds) <= tau
for tau in [0.9, 0.7, 0.5, 0.3]: # sweep the difficulty axis
report(eval_clustered(model, folds(tau)))
random split GRIMM split ------------- ----------- [ A1 A2 | A3 ] leakage [ A1 A2 A3 ] -> train [ B1 | B2 ] leakage [ B1 B2 ] -> test homologs straddle whole clusters the boundary stay on one side
## Reporting: a curve, not a scalar
Instead of one accuracy, GRIMM reports performance as a function of the identity gap δ = 1 − max-identity-to-train. The generalization profile R(δ) exposes exactly where a model stops working:
| Nearest-train identity | What it probes | Typical behavior |
|---|---|---|
| > 0.9 | near-duplicate recall | optimistic, what random splits report |
| 0.5 – 0.7 | within-family transfer | graceful degradation if features generalize |
| < 0.3 | cross-clade generalization | collapse toward prior under low diversity |
The shape of R(δ) separates two indistinguishable-on-paper models: one whose accuracy holds as δ grows is learning transferable biochemistry; one whose accuracy falls off a cliff at low identity is reading homology. A single random-split number hides the difference.
## Diversity as the binding constraint
For enzyme-function prediction the limiting resource is sequence diversity, not label count. A dataset can be large and still cover a thin slice of sequence space. GRIMM makes that legible: the largest τ at which clusters remain populated is a direct read on effective diversity, and the decay of R(δ) bounds the radius in sequence space over which the model is trustworthy.
- The effective sample count for any downstream variance argument is the cluster count, not the sequence count: the same correction that enters the aspect ratio in the companion phase-transition note.
- Stratified folds keep label distributions matched so a difficulty sweep does not confound identity with class imbalance.
- Clustering is single-pass and cached; the marginal cost of adding a τ to the sweep is one evaluation, not one retrain.
Co-authored methods note. Clustering via MMseqs2-style identity thresholds.
Liam Kozma · liam.kozma@protonmail.com