← Back to Home

Research Note · Protein Language Models

Recovery Thresholds and Phase Transitions in Protein Language Models

Abstract

Frozen protein language model (PLM) embeddings carry biochemical structure as a small number of high-variance directions buried in near-isotropic representational noise. Treating the per-token feature covariance as a spiked model, recovery of a biochemical direction is not gradual: it is a threshold phenomenon governed by the Baik–Ben Arous–Péché (BBP) transition. Below a critical signal-to-noise ratio set by the embedding dimension and the effective sample count, the leading sample eigenvector is asymptotically orthogonal to the true signal and contributes nothing to an inverse folding decoder; above it, alignment turns on continuously from zero. This note frames the transition, derives the detectability boundary, and argues that compute-optimal scaling of inverse folding is the problem of keeping every biochemical spike on the supercritical side of its threshold.

keywords: spiked covariance · phase transition · eigenvector overlap · inverse folding · compute-optimal scaling

## Setup

Fix a frozen encoder producing residue embeddings in Rp. For a fixed structural context, collect the embeddings whose label is the residue identity to be recovered. Center them and model the population covariance as isotropic noise plus a low-rank biochemical signal:

Σ = Ip +k=1r βk vkvk (1)

Each vk is a unit biochemical direction (hydrophobicity, charge, a rotamer-discriminating axis) and βk its variance over the noise floor. With n effective samples the relevant asymptotic regime is the high-dimensional one, p, n → ∞ with aspect ratio

γ = p / n ,   fixed. (2)

The word effective is doing work here. Homologous sequences are not independent draws; the count that enters γ is the number of decorrelated observations, which sits far below the raw token count and is exactly what sequence clustering controls (see the companion note on GRIMM).

## The detectability threshold

Under noise alone the sample eigenvalues follow the Marchenko–Pastur law, supported up to a hard upper edge at (1+√γ)2. A planted spike βk only escapes this bulk, only becomes visible, when it clears the BBP threshold:

βk >γ  ⇔  detectable. (3)

The consequence that matters for a decoder is not the eigenvalue but the eigenvector overlap: how much of the recovered direction is real signal versus rotated noise. In the spiked model the squared overlap has a closed form with a hard phase transition at the same boundary,

|⟨k, vk⟩|2 = (1 γ/βk2) / (1 + γ/βk)  for  βk >γ,  else  0. (4)

Below threshold the alignment is asymptotically zero: the top empirical component is a unit vector pointing in a direction the data cannot distinguish from noise. A linear probe or a decoder head reading that component recovers nothing, regardless of how it is trained. Above threshold the overlap rises continuously from zero, so the recoverable fraction of a biochemical feature is a smooth function of how far the spike sits past γ. Recovery turns on; it does not fade in.

  overlap^2
   1.0 |                               . - - - ----
       |                        . '
       |                   . '
       |                 .
       |               .
   0.0 |____________ . _______________________________
       0         sqrt(gamma)                  beta_k ->
Fig. 1   Squared eigenvector overlap against spike strength. Identically zero below the BBP threshold; continuous, concave rise above it.

## From threshold to inverse folding

Inverse folding asks for a residue distribution given a structural context. Cast the decoder as a readout of the supercritical subspace: the directions that survive (3) are the only ones carrying recoverable biochemical contrast, and the achievable cross-entropy is bounded by how much label-relevant variance those directions retain after the overlap penalty in (4). Two regimes follow directly.

  • Subcritical features. A biochemical axis with βk ≤ √γ is invisible at the current scale. No decoder capacity, prompt, or fine-tune recovers it, because the information is not present in the empirical spectrum: it is below the bulk edge.
  • Supercritical features. Once a spike clears threshold its contribution to recovery grows with overlap, then saturates. Marginal returns on that feature shrink as βk ≫ √γ.

The practical lever is γ. Lowering the aspect ratio (more effective samples, or a lower intrinsic embedding dimension) drags the threshold down and pulls features across it. This reframes a class of inverse-folding gains: they are not the decoder getting smarter, they are biochemical spikes crossing γ.

## A compute-optimal reading

Compute is split between making embeddings informative (raising effective βk) and making the threshold low (raising effective n relative to p). The criterion is to allocate so that the biochemical features the task depends on each sit above their BBP boundary by a margin, not to push a single quantity to extremes.

maximize   ∑k wk · overlap2(βk, γ)   s.t.   cost(β, n, p) ≤ C (5)

with wk the task weight on feature k. Because each overlap term is flat-zero below threshold and concave above, the optimum spends first on whichever subcritical feature is cheapest to lift across γ, and stops over-investing in features already deep in the supercritical regime. The estimation protocol is deliberately spectral and decoder-free:

# estimate per-feature recoverability without training a decoder
Z      = encoder(seqs)            # [n, p] frozen residue embeddings
Z      = Z - Z.mean(0)
gamma  = p / effective_n(seqs)    # cluster-corrected sample count
edge   = (1 + sqrt(gamma))**2     # Marchenko-Pastur upper edge

lam    = eigvals(cov(Z))          # sample covariance spectrum
spikes = lam[lam > edge]          # supercritical components only

# invert lambda = (1+beta)(1+gamma/beta) for each detached eigenvalue
beta   = recover_beta(spikes, gamma)
ov2    = (1 - gamma/beta**2) / (1 + gamma/beta)   # Eq. (4)

The estimate is cheap, reusable across decoders, and tells you in advance which biochemical directions a given encoder and dataset can support, before a single training run.

## Caveats

  • The spiked model assumes isotropic residual noise. Real PLM embeddings have structured, anisotropic noise; the bulk edge and threshold shift but the existence of a transition does not.
  • Spike strengths are estimated, and estimation has its own variance near the boundary. Features sitting within sampling error of γ should be treated as undetermined, not absent.
  • Overlap bounds recoverable signal; it does not guarantee a decoder reaches the bound. It is a ceiling, not a forecast.

Working note. Notation follows the standard spiked-covariance and BBP literature.
Liam Kozma · liam.kozma@protonmail.com