← Back to Home

Methods · Benchmarking

GRIMM: Genetic Stratification for Inference in Molecular Modeling

Abstract

Protein function models are routinely reported on random train/test splits. In sequence space a random split is a near-duplicate split: homologs land on both sides, so the test set measures interpolation among relatives rather than generalization to new biology. GRIMM replaces random partitioning with homology-aware stratification. Sequences are clustered at controlled identity, whole clusters are assigned to a single fold, and performance is reported as a curve over the identity gap between test sequences and their nearest training neighbor. The protocol turns a single optimistic scalar into a generalization profile, and it makes low sequence diversity (the dominant failure mode for enzyme-function prediction) visible and quantifiable.

keywords: out-of-distribution · homology leakage · clustered cross-validation · enzyme function · benchmarking

## The leakage problem

Let D be a labeled set of sequences and s(a,b) the pairwise identity between two of them. A random split draws test set T uniformly from D. Because protein datasets are dense with homologs, for almost every test sequence there exists a training sequence with high s:

Ea∈T [ maxb∈Train s(a,b) ]  ≫  baseline identity in D (1)

The reported metric is then dominated by sequences the model has effectively already seen. A model can score well by memorizing family-level motifs and never demonstrate that it has learned anything that transfers to a novel fold or an under-sampled clade. The headline number is real; it just answers the wrong question.

## Stratification protocol

GRIMM enforces a separation constraint between folds. Cluster D at identity threshold τ so that any two sequences above τ share a cluster, then partition whole clusters, never individual sequences, across folds:

s(a,b) > τ  ⇒  cluster(a) = cluster(b)  ⇒  same fold. (2)

This guarantees the maximum test-to-train identity is bounded by τ, removing the leakage in (1) by construction. Sweeping τ sweeps difficulty: a high threshold tests near-family generalization, a low threshold tests transfer across distant sequence space. The pipeline is deterministic and reproducible:

# GRIMM: homology-aware folds from a sequence set
clusters = mmseqs_cluster(seqs, min_seq_id=tau)   # single-linkage at identity tau
folds    = assign_clusters(clusters, k=5,          # whole clusters -> folds
                           balance="label")        # keep label strata even

# guarantee: no test sequence within tau identity of any train sequence
assert max_cross_fold_identity(folds) <= tau

for tau in [0.9, 0.7, 0.5, 0.3]:                   # sweep the difficulty axis
    report(eval_clustered(model, folds(tau)))
  random split                  GRIMM split
  -------------                 -----------
  [ A1 A2 | A3 ]   leakage      [ A1 A2 A3 ] -> train
  [ B1    | B2 ]   leakage      [ B1 B2    ] -> test
   homologs straddle             whole clusters
   the boundary                  stay on one side
Fig. 1   A random split scatters members of a family (A, B) across the boundary; GRIMM keeps each cluster intact in a single fold.

## Reporting: a curve, not a scalar

Instead of one accuracy, GRIMM reports performance as a function of the identity gap δ = 1 − max-identity-to-train. The generalization profile R(δ) exposes exactly where a model stops working:

Nearest-train identityWhat it probesTypical behavior
> 0.9near-duplicate recalloptimistic, what random splits report
0.5 – 0.7within-family transfergraceful degradation if features generalize
< 0.3cross-clade generalizationcollapse toward prior under low diversity

The shape of R(δ) separates two indistinguishable-on-paper models: one whose accuracy holds as δ grows is learning transferable biochemistry; one whose accuracy falls off a cliff at low identity is reading homology. A single random-split number hides the difference.

## Diversity as the binding constraint

For enzyme-function prediction the limiting resource is sequence diversity, not label count. A dataset can be large and still cover a thin slice of sequence space. GRIMM makes that legible: the largest τ at which clusters remain populated is a direct read on effective diversity, and the decay of R(δ) bounds the radius in sequence space over which the model is trustworthy.

  • The effective sample count for any downstream variance argument is the cluster count, not the sequence count: the same correction that enters the aspect ratio in the companion phase-transition note.
  • Stratified folds keep label distributions matched so a difficulty sweep does not confound identity with class imbalance.
  • Clustering is single-pass and cached; the marginal cost of adding a τ to the sweep is one evaluation, not one retrain.

Co-authored methods note. Clustering via MMseqs2-style identity thresholds.
Liam Kozma · liam.kozma@protonmail.com