GRIMM: Genetic Stratification for Out-of-Distribution Evaluation

Abstract

Protein function models are routinely reported on random train/test splits. In sequence space a random split is a near-duplicate split: homologs land on both sides, so the test set measures interpolation among relatives rather than generalization to new biology. GRIMM replaces random partitioning with homology-aware stratification. Sequences are clustered at controlled identity, whole clusters are assigned to a single fold, and performance is reported as a curve over the identity gap between test sequences and their nearest training neighbor. The protocol turns a single optimistic scalar into a generalization profile, and it makes low sequence diversity (the dominant failure mode for enzyme-function prediction) visible and quantifiable.

keywords: out-of-distribution · homology leakage · clustered cross-validation · enzyme function · benchmarking

## The leakage problem

Let D be a labeled set of sequences and s(a,b) the pairwise identity between two of them. A random split draws test set T uniformly from D. Because protein datasets are dense with homologs, for almost every test sequence there exists a training sequence with high s:

E_a∈T [ max_b∈Train s(a,b) ] ≫ baseline identity in D (1)

The reported metric is then dominated by sequences the model has effectively already seen. A model can score well by memorizing family-level motifs and never demonstrate that it has learned anything that transfers to a novel fold or an under-sampled clade. The headline number is real; it just answers the wrong question.

## Stratification protocol

GRIMM enforces a separation constraint between folds. Cluster D at identity threshold τ so that any two sequences above τ share a cluster, then partition whole clusters, never individual sequences, across folds:

s(a,b) > τ ⇒ cluster(a) = cluster(b) ⇒ same fold. (2)

This guarantees the maximum test-to-train identity is bounded by τ, removing the leakage in (1) by construction. Sweeping τ sweeps difficulty: a high threshold tests near-family generalization, a low threshold tests transfer across distant sequence space. The pipeline is deterministic and reproducible:

# GRIMM: homology-aware folds from a sequence set
clusters = mmseqs_cluster(seqs, min_seq_id=tau)   # single-linkage at identity tau
folds    = assign_clusters(clusters, k=5,          # whole clusters -> folds
                           balance="label")        # keep label strata even

# guarantee: no test sequence within tau identity of any train sequence
assert max_cross_fold_identity(folds) <= tau

for tau in [0.9, 0.7, 0.5, 0.3]:                   # sweep the difficulty axis
    report(eval_clustered(model, folds(tau)))

  random split                  GRIMM split
  -------------                 -----------
  [ A1 A2 | A3 ]   leakage      [ A1 A2 A3 ] -> train
  [ B1    | B2 ]   leakage      [ B1 B2    ] -> test
   homologs straddle             whole clusters
   the boundary                  stay on one side

Fig. 1 A random split scatters members of a family (A, B) across the boundary; GRIMM keeps each cluster intact in a single fold.

## Reporting: a curve, not a scalar

Instead of one accuracy, GRIMM reports performance as a function of the identity gap δ = 1 − max-identity-to-train. The generalization profile R(δ) exposes exactly where a model stops working:

Nearest-train identity	What it probes	Typical behavior
> 0.9	near-duplicate recall	optimistic, what random splits report
0.5 – 0.7	within-family transfer	graceful degradation if features generalize
< 0.3	cross-clade generalization	collapse toward prior under low diversity

The shape of R(δ) separates two indistinguishable-on-paper models: one whose accuracy holds as δ grows is learning transferable biochemistry; one whose accuracy falls off a cliff at low identity is reading homology. A single random-split number hides the difference.

## Diversity as the binding constraint

For enzyme-function prediction the limiting resource is sequence diversity, not label count. A dataset can be large and still cover a thin slice of sequence space. GRIMM makes that legible: the largest τ at which clusters remain populated is a direct read on effective diversity, and the decay of R(δ) bounds the radius in sequence space over which the model is trustworthy.

The effective sample count for any downstream variance argument is the cluster count, not the sequence count: the same correction that enters the aspect ratio in the companion phase-transition note.
Stratified folds keep label distributions matched so a difficulty sweep does not confound identity with class imbalance.
Clustering is single-pass and cached; the marginal cost of adding a τ to the sweep is one evaluation, not one retrain.

Co-authored methods note. Clustering via MMseqs2-style identity thresholds.
Liam Kozma · liam.kozma@protonmail.com