Statistical and machine-learning systems for protein data, high-dimensional inference, and bare-metal HPC.
M.S. Statistics · B.S. Biochemical Engineering
I build and stress-test statistical and machine-learning systems where the data does not stay still. My thesis defines a recovery threshold for protein language models under distribution shift: the largest geometric displacement at which a frozen embedding model still resolves structure. The same instinct, measure the point of failure rather than the headline accuracy, carries into autonomous multi-agent systems, quantum machine learning, and production data pipelines run on high-performance compute. I work close to the metal: HPC scheduling on Sapelo2, local inference serving, and a daily Arch Linux environment.
Keywords: distribution shift · protein language models · high-dimensional statistics · multi-agent systems · quantum machine learning · HPC (Sapelo2) · data pipelines
M.S. thesis, treated as a diagnostic rather than a benchmark. Given a held-out protein set drawn off-distribution, locate the point at which a frozen embedding model stops resolving structure. The deliverable is a threshold, not a leaderboard position.
Shift is parameterized by $\delta$, the absolute geometric displacement between the in-distribution and out-of-distribution embedding clouds. Recovery is read across $N = 36$ unaggregated runs, which keeps the variance structure intact for the threshold fit rather than washing it out in a mean.
The recovery threshold $\delta^{\star}$ is the largest shift at which mean reconstruction fidelity stays above $\alpha$. Below it the model recovers; beyond it the geometry collapses.
Embeddings are drawn from a frozen transformer protein language model (ESM-2 class). The diagnostic measures the pretrained geometry itself, not a fine-tuned downstream head.
Each run holds $N=36$ unaggregated samples; fidelity $\mathcal{R}(\delta)$ is estimated per run and the threshold $\delta^{\star}$ is read as the supremum of shifts clearing $\alpha$. Keeping runs unaggregated preserves the run-to-run variance that the threshold estimate depends on, and exposes where the collapse is sharp versus gradual. The shift axis is constructed by displacing the OOD mean along controlled directions, so $\delta$ is a measured geometric quantity rather than a proxy label.
Self-directed research initiative, run independently of any course or lab. The question is narrow and honest: where do parameterized quantum circuits actually buy representational capacity over a classical model of matched parameter count, and where is the apparent advantage an artifact of the encoding. I treat it the same way as the thesis, as a measurement problem rather than a demo.
variational circuits · data re-uploading · classical-shadow readout
The working program: build small variational classifiers, control for encoding so comparisons are fair, and characterize trainability against circuit depth. The output I care about is a clear statement of regime, where the quantum model holds its own and where it does not, not a single accuracy number on a curated task.
Production-shaped engineering: agent systems that run unattended, and the data and compute plumbing underneath them.
LangGraph orchestration · vLLM serving · tool-calling agents
A graph of specialized agents that ingests market and fundamentals data, researches a name across roles, and produces a structured analysis without a human in the loop. Orchestration is LangGraph: explicit state, typed edges, and retries rather than a free-running chat. Inference is served locally through vLLM, which keeps throughput high and the models, prompts, and data on hardware I control instead of a third-party API.
Sapelo2 · Slurm scheduling · reproducible environments
The pipelines that feed the work above: batched embedding and feature jobs scheduled on the Sapelo2 HPC cluster, reproducible environments so a run reproduces months later, and the unglamorous data wrangling that decides whether a result is real. Daily environment is Arch Linux, configured deliberately rather than inherited.
Biochemical engineering, before the move into statistics and ML. Expanded in place; no detail lives on a separate page.
SciPy · DEAP evolutionary solver · reactor mass balance
Optimized E. coli fed-batch enzyme yield with a SciPy/DEAP evolutionary solver run over a reactor mass balance. Led the project; placed top-3 at the UGA Quick Pitch competition.
Ethanol fermentation, Ziegler-Nichols controller tuning, and distillation across a multi-unit bioprocess sequence.
VOC-scrubbing biofiltration at 220,000 gal/min. Hit 95% removal of the rate-limiting compound at 5% below baseline cost.
PCR, E. coli transformation, and protein purification across a synthetic-biology workflow.
Fourier conduction, diffusion, and PID tuning across the transport-phenomena sequence.
Design of a 100 g/day mAb facility: downstream processing and process economics.
AutoCAD teardown and dimensional reconstruction of a skateboard truck assembly.