Liam Kozma Statistics · Machine Learning · Athens, GA

Liam Kozma

Statistical and machine-learning systems for protein data, high-dimensional inference, and bare-metal HPC.

M.S. Statistics  ·  B.S. Biochemical Engineering

liamkozmabiz@gmail.com/ LinkedIn/ GitHub/ Scholar/ CV (PDF)

Abstract

I build and stress-test statistical and machine-learning systems where the data does not stay still. My thesis defines a recovery threshold for protein language models under distribution shift: the largest geometric displacement at which a frozen embedding model still resolves structure. The same instinct, measure the point of failure rather than the headline accuracy, carries into autonomous multi-agent systems, quantum machine learning, and production data pipelines run on high-performance compute. I work close to the metal: HPC scheduling on Sapelo2, local inference serving, and a daily Arch Linux environment.

Keywords:  distribution shift · protein language models · high-dimensional statistics · multi-agent systems · quantum machine learning · HPC (Sapelo2) · data pipelines

§ 1

Recovery threshold for protein language models under distribution shift

M.S. thesis, treated as a diagnostic rather than a benchmark. Given a held-out protein set drawn off-distribution, locate the point at which a frozen embedding model stops resolving structure. The deliverable is a threshold, not a leaderboard position.

Shift is parameterized by $\delta$, the absolute geometric displacement between the in-distribution and out-of-distribution embedding clouds. Recovery is read across $N = 36$ unaggregated runs, which keeps the variance structure intact for the threshold fit rather than washing it out in a mean.

The recovery threshold $\delta^{\star}$ is the largest shift at which mean reconstruction fidelity stays above $\alpha$. Below it the model recovers; beyond it the geometry collapses.

Embeddings are drawn from a frozen transformer protein language model (ESM-2 class). The diagnostic measures the pretrained geometry itself, not a fine-tuned downstream head.

Method & estimator how the threshold is fit from N = 36

Each run holds $N=36$ unaggregated samples; fidelity $\mathcal{R}(\delta)$ is estimated per run and the threshold $\delta^{\star}$ is read as the supremum of shifts clearing $\alpha$. Keeping runs unaggregated preserves the run-to-run variance that the threshold estimate depends on, and exposes where the collapse is sharp versus gradual. The shift axis is constructed by displacing the OOD mean along controlled directions, so $\delta$ is a measured geometric quantity rather than a proxy label.

Fig. 1. Sequence → per-residue embeddings → positional encoding → mean-pooled protein embedding. The same representation $\delta$ is measured on.
§ 2

Quantum machine learning

Self-directed research initiative, run independently of any course or lab. The question is narrow and honest: where do parameterized quantum circuits actually buy representational capacity over a classical model of matched parameter count, and where is the apparent advantage an artifact of the encoding. I treat it the same way as the thesis, as a measurement problem rather than a demo.

Scope of the initiative autonomous · ongoing

variational circuits · data re-uploading · classical-shadow readout

The working program: build small variational classifiers, control for encoding so comparisons are fair, and characterize trainability against circuit depth. The output I care about is a clear statement of regime, where the quantum model holds its own and where it does not, not a single accuracy number on a curated task.

§ 3

Autonomous systems & infrastructure

Production-shaped engineering: agent systems that run unattended, and the data and compute plumbing underneath them.

Autonomous multi-agent stock analysis system LangGraph · vLLM · local inference

LangGraph orchestration · vLLM serving · tool-calling agents

A graph of specialized agents that ingests market and fundamentals data, researches a name across roles, and produces a structured analysis without a human in the loop. Orchestration is LangGraph: explicit state, typed edges, and retries rather than a free-running chat. Inference is served locally through vLLM, which keeps throughput high and the models, prompts, and data on hardware I control instead of a third-party API.

Data pipelines & HPC Sapelo2 cluster · Slurm · Arch Linux

Sapelo2 · Slurm scheduling · reproducible environments

The pipelines that feed the work above: batched embedding and feature jobs scheduled on the Sapelo2 HPC cluster, reproducible environments so a run reproduces months later, and the unglamorous data wrangling that decides whether a result is real. Daily environment is Arch Linux, configured deliberately rather than inherited.

§ 4

Engineering record

Biochemical engineering, before the move into statistics and ML. Expanded in place; no detail lives on a separate page.

L-Asparaginase strain engineering 2023 · project lead · Top-3, UGA Quick Pitch

SciPy · DEAP evolutionary solver · reactor mass balance

Optimized E. coli fed-batch enzyme yield with a SciPy/DEAP evolutionary solver run over a reactor mass balance. Led the project; placed top-3 at the UGA Quick Pitch competition.

L-Asparaginase strain engineering
Senior bioprocess lab series 2023 · process control

Ethanol fermentation, Ziegler-Nichols controller tuning, and distillation across a multi-unit bioprocess sequence.

Senior bioprocess lab series
Biofilter reactor kinetics 2022 · kinetic modeling

VOC-scrubbing biofiltration at 220,000 gal/min. Hit 95% removal of the rate-limiting compound at 5% below baseline cost.

Biofilter reactor kinetics
Metabolic & synthetic biology lab 2022 · molecular biology

PCR, E. coli transformation, and protein purification across a synthetic-biology workflow.

Metabolic and synthetic biology lab
Junior transport & kinetics lab series 2022 · transport phenomena

Fourier conduction, diffusion, and PID tuning across the transport-phenomena sequence.

Junior transport and kinetics lab series
Monoclonal antibody bioprocess 2021 · plant design

Design of a 100 g/day mAb facility: downstream processing and process economics.

Monoclonal antibody bioprocess
Skateboard reverse-engineering 2020 · CAD

AutoCAD teardown and dimensional reconstruction of a skateboard truck assembly.

Skateboard reverse-engineering