This portfolio collects systems I’ve designed and built while working hands-on with large language models, agentic workflows, and the infrastructure that makes them production-ready. Each case study below covers the architecture, the decisions I made along the way, and the tradeoffs I weighed.
Scan the cards for a quick look, or expand any case study below for the full writeup.
Bootstrap of a permanent two-board ESP32 lab with an always-on agent reachable via chat. Validated end-to-end with 1 Hz ESP-NOW pingpong (100/100 frames) and RSSI-based distance estimation.
A custom four-message OTA protocol for pushing firmware between ESP32s over ESP-NOW (no WiFi, no BLE). Protocol and partition layout proven end-to-end on small payloads; the full 742 KB push reaches 97% before crashing on memory pressure.
Follow-up eval comparing llama.cpp and MLX on a focused K8s troubleshooting workload. 13 model+backend combinations, 20 triage scenarios, LLM-as-judge scoring. Stay on llama.cpp confirmed; granite-4.1-8b-tuned emerged as the best speed-at-quality default.
Side-by-side evaluation of llama.cpp and MLX (via oMLX) for self-hosted inference on a single M1 Pro. Custom blind-scoring harness, 6 matched model pairs, 780 measurement rows. Zero migration recommended, and a silent audio bug surfaced in the MLX path.
The real artifact here is the lab, not the demos that prove it works. Two ESP32 DevKitC v1 boards stay wired to the host laptop over CP2102 USB-serial bridges. ESP-IDF 6.1.0 runs inside a Fedora Atomic toolbox container so the immutable host stays clean. A Hermes-based agent runs on the host and is reachable from my phone over chat, so I can ask “what’s the responder showing?” or “rebuild and reflash” without sitting down at the bench. That setup is what I’ll reuse across every ESP32 project from here forward; pingpong and RSSI were the smoke tests that proved every piece is alive.
espnow_pingpong is the obvious starting point: one board sends a small frame over ESP-NOW every second, the other receives it, both log to USB serial. ESP-NOW has a reputation for fragility in its official examples, and the first build confirmed it. Sender and receiver had separately-defined frame structs that were silently misaligned, so every frame failed CRC validation with the unhelpful message “Receive error data.” The fix was to consolidate the struct into a shared header, mark it __attribute__((packed)) to disable compiler padding, and match the layout byte-for-byte. The second build ran 100 frames with zero errors.
espnow_rssi reuses the same two boards but adds a small Python script on the host that reads both serial streams and computes a real-time distance estimate using the standard log-distance propagation model: distance = 10^((A - rssi) / (10 * n)). The first version had a typo in the destination MAC: both boards were unicasting to themselves rather than to each other, so they were happily “transmitting” while not actually communicating. Switching to the broadcast MAC fixed it in one line.
n = 1.6 as the path-loss exponent, which is closer to free-space than the 2–4 range typical for indoor environments. Relative changes were fine; absolute readings would need either calibration in the actual environment or a higher n.The lab-as-persistent-artifact framing is the takeaway worth keeping. Most embedded tutorials assume you bring up the hardware for each project and tear it down at the end. For someone doing many small projects in sequence, that’s the wrong shape: the setup cost gets paid every time and dominates the actual learning. A bench that stays wired up between projects, with an agent that stays reachable over chat, makes “I have 20 minutes, let me check on the test” a viable mode of work instead of an hour-long context switch.
On ESP-NOW specifically: its error messages are uniformly unhelpful, and the only reliable debugging move is logging the raw frame bytes in hex on both ends from the start. Both bugs I hit here (struct misalignment, MAC typo) would have been caught immediately by hex-dump logging, and adding it earlier in the next project saved hours.
/home/drain/esp/projects/espnow_pingpong/, /home/drain/esp/projects/espnow_rssi/esp_https_ota. Designing the protocol from scratch over a non-IP transport forced me to understand every layer (frame format, partition table, bootloader handoff, ack semantics) instead of treating OTA as a black box.The goal was simple to state and hard to execute: one ESP32 (the initiator) reads a new firmware image from my laptop over USB serial and pushes it over ESP-NOW to a second ESP32 (the responder), which writes it to a staging partition, validates it, and reboots into the new firmware. ESP-NOW is Espressif’s proprietary peer-to-peer radio protocol. It lets two ESP32s talk directly without joining a WiFi network, but it caps payloads at 250 bytes per frame, which means a 742 KB firmware turns into thousands of frames.
I designed a four-message protocol: OTA_BEGIN (carries total size and version string), OTA_CHUNK (sequence number plus up to 200 bytes of payload, leaving room for a small header), OTA_END (CRC32 over the whole image), and OTA_ACK (sent back from responder to initiator after each chunk). The responder uses ESP-IDF’s standard OTA partition layout: a factory slot, two OTA slots (ota_0 and ota_1), and a small otadata partition that the bootloader reads to decide which slot to boot next.
Most of the three days went to build-system and partition-table work, not protocol logic. ESP-IDF defaults to a 2 MB flash size and a single-app partition table; getting it to a 4 MB layout with two OTA slots required setting CONFIG_ESPTOOLPY_FLASHSIZE_4MB=y first, before any other partition-table edits, because idf.py set-target esp32 silently resets partition-related config values. I also hit a build bug in components/esp_https_ota/CMakeLists.txt (missing bootloader_support in its REQUIRES list, fixed with a one-line patch to the IDF) and a UART driver conflict where my initiator’s uart_read_bytes() calls fought the console driver for UART0. Fixed by switching to fgetc(stdin), which goes through the VFS layer and shares the UART cleanly.
OTA_BEGIN and OTA_ACK round-trip cleanly. The staging partition is found, erased (~3 seconds), and written to. Payloads of 1 byte, 100 bytes, and 1 KB stage with 100% reliability across repeated runs.OTA_END arrives.esp_partition_write(), and frees it. After roughly 700 KB of these cycles, a subsequent allocation likely fails. I did not instrument with heap_caps_get_free_size() before stopping, so this is consistent with the symptoms but not proven. Probable fixes: move the staging loop to a dedicated FreeRTOS task with explicit yields, or switch from raw esp_partition_write() to the buffered esp_ota_write() path that Espressif uses in their own OTA examples.The most useful habit I picked up was logging the input, the output, and the byte count at every protocol boundary from the start, not as a debugging step after something breaks. ESP-NOW’s error messages are uniformly unhelpful (“Receive error data” can mean CRC mismatch, length mismatch, or struct-layout mismatch with no way to tell), and the only reliable diagnostic is comparing hex dumps on both ends. Adding that logging once cost an hour; not having it cost most of day one.
The bigger lesson was procedural. I went straight to a 742 KB payload because that’s the “real” test, and the resulting crash was hard to diagnose because the protocol logic, the partition setup, and the memory behavior were all unproven at the same time. A 1 KB smoke test first would have isolated the memory problem cleanly. Next OTA project: validate the smallest end-to-end path before scaling up the payload.
/home/drain/esp/projects/espnow_ota/qwen3-14b-tuned as the LLM-as-judge. Composite quality is format-gated and built from deterministic verifier passes, action safety, and judge rubric scores.granite-4.1-8b-tuned at 0.79 composite quality with a 12 second median wall time.The first eval established that MLX did not justify migration on general workloads. The open question was whether the verdict held on the specific shape of work the cluster runs most often: short structured Kubernetes triage prompts (0.5 to 8 KB snapshot dumps, structured answers, deterministic verification possible). Bench-v2 was built to answer that.
The setup ran 13 model and backend combinations across 20 hand-written K8s triage scenarios (single rep per cell, 260 rows). The scenarios target a known fault in a vcluster sandbox (bench-v2-sandbox on K8s 1.35.0), and each model is asked to identify the fault and propose a safe action. Scoring is format-gated: format compliance must be strict before the composite is computed; the composite then averages deterministic pass, action safety, and three judge-derived scores. The judge model is qwen3-14b-tuned running on llama-server.
Three sweeps ran in sequence. A 13-cohort all-in-one (PID 88350) crashed silently because oMLX died before launch, which surfaced the need for an omlx_alive precheck and PYTHONUNBUFFERED=1. A llama-only resume ran clean. An MLX-only resume used a fresh oMLX process per cohort to dodge the swap-evict OOM that hit on May 31 and again on June 3 during mid-cohort model evictions.
granite-4.1-8b-tuned landed at 0.79 composite quality (tied for second overall) with a 12 second median wall time, the best speed-at-quality tradeoff in the lineup. It was added late as a v2 addition; in hindsight it should have been a Phase 0 baseline.judge_unavailable) for both backends, suppressing composite quality by about 0.4 points purely from missing judge data. Deterministic pass is 90 percent on both, the best of any model. Granite-30b may actually be the highest-quality model in the lineup. Filed as TODO.unparseable format compliance, vs 95 percent strict under MLX with the same weights. Almost certainly a sampling, chat-template, or EOS-token mismatch in the llama-server invocation. The gemma4-e4b row in the per-family verdict is not a backend comparison until that’s fixed.cf_01_missing_cm (0%), cc_03_image_pull_secret_missing (0%), rbac_02_wrong_verb (23%). When every model produces well-formed answers that the verifier rejects, the truth definition is more likely wrong than the models. Likely verifier bugs in verifier.py.The mental model shift from this round was operational, not technical. MLX’s theoretical prefill and TTFT edge did not show up as throughput wins on this prompt shape (short structured prompts, short structured answers). At matched 4-bit quant, quality is genuinely indistinguishable across 5 of 6 families within ±0.10 composite. The “Q4_K_M vs MLX -q4(gs=64) is a polite fiction” caveat from the plan held: the differences are real but small.
The operational fragility finding is the one I would not have predicted. Both crashes came from swap-evict races during model rotations, which is the path a real serving setup exercises but a throughput-only post never does. The granite-8b find was a happy accident: added late as a v2 baseline, it won the speed-at-quality leaderboard. The lesson worth carrying is to include smaller, less-hyped models as baselines; they sometimes lap the field.
~/Projects/01-homelab/mbp-local-inference/bench-v2/decision.md~/.claude/plans/i-currently-use-litellm-frolicking-wave.mdThe question was simple: should I move my self-hosted LLM stack (coding daily-driver, Obsidian vault chat, long-context planner, and Gemma 4 multimodal experiments) from llama.cpp to Apple’s MLX framework? MLX benchmark posts had been claiming significant wins on Apple Silicon, and the current llama.cpp setup behind a LiteLLM proxy works fine but isn’t the point of comparison those benchmarks use. I wanted to know whether the gains were real for the workloads I actually run.
The eval rig ran side-by-side with the production stack, never interrupting cluster traffic. Both backends sat behind the same LiteLLM proxy, so the harness measured the end-to-end path the cluster uses, not raw server throughput. Per-model llama-server instances were started and stopped on demand by a small shell manager because 32 GB of RAM wouldn’t hold them all resident; oMLX served all MLX models from a single process with LRU eviction. A small Python harness wrote one CSV row per streamed request, capturing TTFT, decode tok/s, peak RSS, wall time, and both backend versions.
Three methodology calls mattered. First, four weighted metrics with quality at 0.40: a faster wrong answer is worse than a slower right one, and the weight explicitly prevents “MLX wins on tok/s” from auto-meaning “migrate.” Second, blind quality scoring: the scoring CLI hides which backend produced each completion until after the rubric score is submitted, preventing me from unconsciously favoring whichever backend I wanted to win. Third, matched-pair sweeps with per-cohort orchestration so the RAM ceiling never got violated; one non-production server resident at a time. Multimodal evaluation went through a separate runner using OpenAI-format input_audio and image_url content types, scored with jiwer-WER against ground-truth transcripts for audio and the same blind rubric for text and image.
mlx-audio, returns the same canned hallucination for every audio clip: a fixed English sentence regardless of input. Diagnosis is a mel-filter configuration mismatch in oMLX’s preprocessing producing degenerate features, so the model falls back to a generic completion. Failure is silent: HTTP 200, plausible text, no diagnostic to the client. llama.cpp transcribes the same clips at WER ~0.15 (~85% accuracy). A reproduction + diagnostic was drafted as a GitHub issue against the oMLX repo.wired_count, not the process RSS field psutil samples, so the 0.15 RSS weight effectively favored MLX by an unknown margin. A proper workaround would sample vm_stat wired-page deltas; logged as follow-up. Sample size is also small (1 to 7 quality scores per text combo, 15 reps per audio combo), directional rather than statistically significant.-q4 (gs=64) have different bit-budgets and error distributions. I judged each backend at its “recommended 4-bit” rather than normalizing the difference away.The framework that made the eval credible was the 40% quality weight plus blind scoring. Without either piece, the verdicts would have drifted. With both, the speed gaps that looked decisive on paper (10×, 19× TTFT) still didn’t justify a migration because the quality side didn’t separate the backends enough to matter. The plan agent predicted “partial migration, multimodal only” before any data came in; the data inverted that. I’d reach for the same framework for the next backend decision and trust the data over the prior.
The audio bug is the meta-lesson about benchmark-only evaluation. Throughput posts wouldn’t have caught it. The MLX path returns plausible text fast, and only running real audio through the production code path with WER scoring against ground truth made the failure visible. The scope of the bench mattered more than the metrics in the bench.
One scope caveat worth being explicit about: this eval was on an M1 Pro with 32 GB. MLX’s gains scale with memory bandwidth, and a higher-bandwidth Apple Silicon part (M3 Max, M4 Max, M-Ultra) could plausibly flip multiple verdicts. The conclusion is “stay on llama.cpp for this hardware,” not “MLX is worse.”