The recovery threshold under distribution shift

Abstract

A three-layer classifier over 1280-D ESM-2 embeddings is trained on one protein population and adapted, batch by batch, toward a shifted one; the question is how much target data the adaptation pool must hold before the model overcomes negative transfer and realigns to the target manifold. The original apparatus draws both populations from one Gaussian mixture labeled by a single frozen network, which fixes P(y | x) everywhere and nests the source support inside the target's: adaptation is monotonic and the dip never appears. Replacing the simulator with real ESM-2 embeddings of UniProt proteins under a Bacteria→Archaea shift, with Pfam family labels shared across both, produces the negative-transfer dip for the first time. Across a pool-composition sweep, target F1 falls below its pre-adaptation baseline at low out-of-distribution fraction and recovers only once that fraction reaches roughly one half. The recovery threshold is r ≈ 0.5.

keywords: covariate shift · negative transfer · protein language models · recovery threshold

## A simulator that cannot dip

Labels come from a frozen, randomly initialized network: a RandomOracleNN maps a 1280-D embedding to a class by argmax over its logits. The same oracle labels source and target, so P(y | x) is one fixed function over the entire embedding space. Embeddings are Gaussian-mixture draws around shared family centroids, and the shift multiplier k changes only the sampling dispersion around those centroids:

σ_src = σ₀ / max(1, k) ≤ σ₀ = σ_tgt (1)

Here σ₀ is the base dispersion and k ≥ 1, so the source is sampled tighter than the target around identical centroids. The source support is a lower-variance subset of the target's, labeled by the same smooth function. A source-trained classifier is therefore already approximately correct on the target, and adaptation only adds information: improvement is monotonic and the dip the experiment was built to measure cannot occur. Raising k tightens the source further but never displaces it; no synthetic shift induces a conflicting optimum. The null result is a property of the construction, not the data.

## Real embeddings under a taxonomic shift

Negative transfer needs the source-optimal boundary to be actively wrong on the target, which requires either a label function that differs across populations or supports that are partially disjoint. Both follow from using real data. Sequences are pulled from UniProt with Pfam family as the label and taxonomic domain as the shift axis: source is Bacteria, target is Archaea, drawn from the same 16 families. Each sequence is embedded with ESM-2 (esm2_t33_650M_UR50D), mean-pooled over residues to 1280-D. The width matches the synthetic embeddings, so model.py, train.py, and the adaptation loop run unchanged.

The source model trains on Bacteria alone. The adaptation pool is a mixture: fraction r from Archaea, the remainder from held-out Bacteria, with pool size held fixed across r. The test set is pure Archaea, the manifold the model is asked to reach. Adaptation streams the pool in batches of 32 under Adam (lr = 1×10⁻³) and evaluates after every step. The synthetic default of one evaluation per 500 batches would have hidden any transient dip; evaluating per step samples it.

## The recovery threshold

Sweeping r over {0, 0.1, 0.25, 0.5, 1.0} at three seeds, against a baseline target F1 of 0.896:

OOD fraction r	dip depth (mean ± std)	recovered	final F1
0.00	0.087 ± 0.018	0/3	0.836
0.10	0.189 ± 0.071	0/3	0.810
0.25	0.107 ± 0.030	1/3	0.880
0.50	0.113 ± 0.040	2/3	0.904
1.00	0.053 ± 0.013	3/3	0.942

A pool of pure in-distribution data still perturbs target F1 by about 0.09 — the noise floor of small-batch SGD on a small pool. At r = 0.10 the dip is 0.189 ± 0.071, cleanly above that floor: adapting on a little out-of-distribution data makes target performance worse before it improves. This is the negative transfer the synthetic construction forbids. Recovery is the cleaner signal. The fraction of seeds that climb back to baseline goes 0, 0, 1/3, 2/3, 3/3 as r rises, and final target F1 climbs monotonically from 0.836 to 0.942. Below r ≈ 0.5 the model tends to stay stuck in negative transfer; at or above it, it reliably realigns. Pools are equal size across all r, so the effect is composition, not volume.

## Limits

Only the r = 0.10 dip is statistically clean above the noise floor at three seeds; the intermediate dips sit inside the noise band, and the recovery trend carries more of the evidence than dip depth. The result is one configuration: a single 650M-parameter PLM, mean-pooled, one Bacteria→Archaea shift, 16 families, one optimizer. The baseline sits at 0.90, leaving limited headroom. A more distant taxonomic shift would deepen the dip and sharpen the threshold.

Python, PyTorch, ESM-2 (fair-esm), NumPy, SciPy. Nextflow DAG over SLURM; NVIDIA A100 GPUs on the UGA Sapelo2 cluster. UniProt sequences, Pfam labels.
Liam Kozma · liam@liamkozma.com