AI Systems Portfolio

Benchmarks and writeups from working with local LLMs, agentic systems, and self-hosted AI infrastructure. Benchmarks lean technical and come with harness code you can run; writeups explain what I built and what I learned, without assuming you already know the jargon.

Hardware note: unless a writeup says otherwise, every benchmark on this page was run on a 2020 M1 MacBook Pro with 32 GB of unified memory - the only machine I currently have for self-hosted inference. Numbers should be read as “what this workload does on this hardware,” not as backend-vs-backend verdicts in general.

Longer-form notes on what I built and what I learned, written under the current eval-framework rigor bar.

Building an eval framework after my benchmark flipped

My June inference bench said llama.cpp was faster than oMLX for specific models. Three weeks later a rerun said oMLX was faster by 46 percent. After investigating in circles, I ditched my own benchmarking scripts. I set up a self-hosted LangFuse instance on my home Kubernetes cluster, centered my benchmarking process around this, and ran the previous bench through it as a sanity check.

Jul 2026

Earlier work from before I formalized measurement into an eval framework. Learning-in-public rather than rigorous eval; kept here for context.

llama.cpp vs oMLX on M1

Two 4-bit inference benches on a 32 GB M1. A 5-arm speculative-decoding comparison on Qwen3-8B (plain decode won every matchup) and a 2-arm oMLX vs llama.cpp sweep across four daily-driver models (oMLX faster on three, wash on one).

Jul 2026

MLX vs llama.cpp on K8s triage (bench v2)

Follow-up eval comparing llama.cpp and MLX on a focused K8s troubleshooting workload. 13 model+backend combinations, 20 triage scenarios, LLM-as-judge scoring. Stay on llama.cpp confirmed; granite-4.1-8b-tuned emerged as the best speed-at-quality default.

Jun 2026

Frontier vs local LLMs on structured HTML

22-prompt structured-generation benchmark across 17 frontier and 5 local LLMs on HyperFrames HTML composition. Deterministic grading via lint and render. kimi-k2.5 scores a perfect 22/22 at 36.0s median; best local gemma3-12b at 14/22. The real frontier advantage is consistency in the tail, not raw accuracy on routine work.

Jun 2026

ASR engines on Apple Silicon

Four-engine reproducible benchmark of whisperx, parakeet-mlx, mlx-whisper, and whisper.cpp on Apple Silicon. 8 cohorts across 16 LibriSpeech clips, scored on WER and real-time factor. parakeet-mlx tdt-1.1b wins on the speed-at-quality frontier; mlx-whisper large-v3 disqualified itself on a hallucination loop.

Jun 2026

Zero-shot voice cloning on Apple Silicon

Replaced a planned Piper fine-tune with zero-shot F5-TTS-MLX for video narration on Apple Silicon. A 12.24-second reference clip and a tuned ODE solver produce usable voice clones at ~19x real-time, setting the pipeline's latency floor and shaping worker-daemon concurrency.

Jun 2026

Where a video pipeline spends its time

End-to-end timing benchmark of the 6-step Windmill video-creation pipeline on Apple Silicon. 16 clips x 3 reps, 48 runs, RTF 1.0 versus RTF 0.055 for parakeet-mlx in isolation; the ~18x production tax is ffmpeg pre/post-processing, not Windmill scheduling.

Jun 2026

OTA firmware over ESP-NOW

A custom four-message OTA protocol for pushing firmware between ESP32s over ESP-NOW (no WiFi, no BLE). Protocol and partition layout proven end-to-end on small payloads; the full 742 KB push reaches 97% before crashing on memory pressure.

Jun 2026

Two-board ESP32 lab reachable by chat

Bootstrap of a permanent two-board ESP32 lab with an always-on agent reachable via chat. Validated end-to-end with 1 Hz ESP-NOW pingpong (100/100 frames) and RSSI-based distance estimation.

May 2026

MLX vs llama.cpp on M1 Pro (bench v1)

Side-by-side evaluation of llama.cpp and MLX (via oMLX) for self-hosted inference on a single M1 Pro. Custom blind-scoring harness, 6 matched model pairs, 780 measurement rows. Zero migration recommended, and a silent audio bug surfaced in the MLX path.

May 2026