AI Systems Portfolio

This portfolio collects systems I’ve designed and built while working hands-on with large language models, agentic workflows, and the infrastructure that makes them production-ready. Each case study below covers the architecture, the decisions I made along the way, and the tradeoffs I weighed.

Scan the cards for a quick look, or expand any case study below for the full writeup.

Two-Board ESP32 Lab: ESP-NOW Foundations
Late May 2026, Working

Bootstrap of a permanent two-board ESP32 lab with an always-on agent reachable via chat. Validated end-to-end with 1 Hz ESP-NOW pingpong (100/100 frames) and RSSI-based distance estimation.

embedded esp32 esp-now lab-setup agentic
Custom OTA over ESP-NOW
Late May to Early June 2026, Partial (97%)

A custom four-message OTA protocol for pushing firmware between ESP32s over ESP-NOW (no WiFi, no BLE). Protocol and partition layout proven end-to-end on small payloads; the full 742 KB push reaches 97% before crashing on memory pressure.

embedded esp32 esp-now ota protocol-design
LLM Backend Eval Round 2: K8s Triage Scenarios
June 2026, Complete

Follow-up eval comparing llama.cpp and MLX on a focused K8s troubleshooting workload. 13 model+backend combinations, 20 triage scenarios, LLM-as-judge scoring. Stay on llama.cpp confirmed; granite-4.1-8b-tuned emerged as the best speed-at-quality default.

evaluation self-hosted apple-silicon llm-ops kubernetes
Local LLM Inference Backend Eval: MLX vs llama.cpp on M1 Pro
May 2026, Complete

Side-by-side evaluation of llama.cpp and MLX (via oMLX) for self-hosted inference on a single M1 Pro. Custom blind-scoring harness, 6 matched model pairs, 780 measurement rows. Zero migration recommended, and a silent audio bug surfaced in the MLX path.

evaluation self-hosted apple-silicon llm-ops benchmarking

Case Studies

Two-Board ESP32 Lab: ESP-NOW Foundations

Late May 2026, Working
ESP32 DevKitC v1 ESP-IDF 6.1.0 Fedora Atomic toolbox ESP-NOW CP2102 USB-serial Hermes Agent

TL;DR

  • What: Stood up a permanent two-board ESP32 lab (boards wired to the host laptop, ESP-IDF in a container, and an always-on agent reachable from my phone), then validated the radio link with 1 Hz ESP-NOW pingpong and RSSI-based distance estimation.
  • Outcome: Lab is live and the workflow works. Pingpong ran 100 frames at 100% reliability; RSSI tracked relative distance changes correctly (~6 dBm drop when I moved a board from ~1 m to ~2 m), though absolute distance numbers were off.
  • Why interesting: The real artifact is the lab, not the protocols. Pingpong is just “hello world.” The point is having a bench that stays wired up between projects and an agent I can message (“how’s the test going?”) instead of having to sit down at the laptop to find out.
  • Lesson: Treating the bench as a persistent product, not a per-project setup, is the speed multiplier. The same is true for the agent: when it stays running and reachable over chat, the iteration loop stops being “open laptop, attach to serial, check logs” and becomes “send a message.”

What I built

The real artifact here is the lab, not the demos that prove it works. Two ESP32 DevKitC v1 boards stay wired to the host laptop over CP2102 USB-serial bridges. ESP-IDF 6.1.0 runs inside a Fedora Atomic toolbox container so the immutable host stays clean. A Hermes-based agent runs on the host and is reachable from my phone over chat, so I can ask “what’s the responder showing?” or “rebuild and reflash” without sitting down at the bench. That setup is what I’ll reuse across every ESP32 project from here forward; pingpong and RSSI were the smoke tests that proved every piece is alive.

espnow_pingpong is the obvious starting point: one board sends a small frame over ESP-NOW every second, the other receives it, both log to USB serial. ESP-NOW has a reputation for fragility in its official examples, and the first build confirmed it. Sender and receiver had separately-defined frame structs that were silently misaligned, so every frame failed CRC validation with the unhelpful message “Receive error data.” The fix was to consolidate the struct into a shared header, mark it __attribute__((packed)) to disable compiler padding, and match the layout byte-for-byte. The second build ran 100 frames with zero errors.

espnow_rssi reuses the same two boards but adds a small Python script on the host that reads both serial streams and computes a real-time distance estimate using the standard log-distance propagation model: distance = 10^((A - rssi) / (10 * n)). The first version had a typo in the destination MAC: both boards were unicasting to themselves rather than to each other, so they were happily “transmitting” while not actually communicating. Switching to the broadcast MAC fixed it in one line.

What worked, what didn’t

  • Worked: Pingpong 100/100 frames after the struct fix. RSSI ranging tracked relative motion correctly: moving a board from ~1 m to ~2 m produced a ~6 dBm RSSI drop, which is what the log-distance model predicts. The persistent lab + chat-accessible agent worked exactly as I hoped: I checked in on RSSI runs from another room without touching the laptop.
  • Didn’t work: Absolute distances from RSSI were off. I’d picked n = 1.6 as the path-loss exponent, which is closer to free-space than the 2–4 range typical for indoor environments. Relative changes were fine; absolute readings would need either calibration in the actual environment or a higher n.
  • Hardware quirk: One of the two boards’ CP2102 USB-serial bridges went flaky after the first session. TX went silent after a successful flash, returned after a power cycle, eventually wouldn’t recover. Swapping in the spare board fixed it permanently. The corollary is that a two-board lab needs a third board on hand; without one you can’t distinguish “my code is broken” from “my hardware is broken.”

What I learned

The lab-as-persistent-artifact framing is the takeaway worth keeping. Most embedded tutorials assume you bring up the hardware for each project and tear it down at the end. For someone doing many small projects in sequence, that’s the wrong shape: the setup cost gets paid every time and dominates the actual learning. A bench that stays wired up between projects, with an agent that stays reachable over chat, makes “I have 20 minutes, let me check on the test” a viable mode of work instead of an hour-long context switch.

On ESP-NOW specifically: its error messages are uniformly unhelpful, and the only reliable debugging move is logging the raw frame bytes in hex on both ends from the start. Both bugs I hit here (struct misalignment, MAC typo) would have been caught immediately by hex-dump logging, and adding it earlier in the next project saved hours.

  • Code: /home/drain/esp/projects/espnow_pingpong/, /home/drain/esp/projects/espnow_rssi/
  • Related entry: Custom OTA over ESP-NOW, the next project that ran on this lab and pushed it harder

Custom OTA over ESP-NOW

Late May to Early June 2026, Partial (97%)
ESP32 DevKitC v1 ESP-IDF 6.1.0 Fedora Atomic toolbox ESP-NOW Custom OTA protocol Hermes Agent

TL;DR

  • What: A custom OTA firmware-update protocol that pushes a new firmware image from one ESP32 to another over ESP-NOW (no WiFi, no BLE), with just a radio link and a USB cable feeding the initiator.
  • Outcome: Protocol, partition layout, and end-to-end flow all proven on small payloads (1 byte, 100 bytes, 1 KB) at 100% reliability. The full 742 KB firmware push reaches 97% and then crashes on a memory issue, not a protocol issue.
  • Why interesting: Most OTA tutorials use WiFi and Espressif’s prebuilt esp_https_ota. Designing the protocol from scratch over a non-IP transport forced me to understand every layer (frame format, partition table, bootloader handoff, ack semantics) instead of treating OTA as a black box.
  • Lesson: Knowing when to stop is a real engineering skill. The remaining 3% is a memory-management optimization, not a design problem, and I made a deliberate call to document the diagnosis and move on rather than spend another week chasing it.

What I built

The goal was simple to state and hard to execute: one ESP32 (the initiator) reads a new firmware image from my laptop over USB serial and pushes it over ESP-NOW to a second ESP32 (the responder), which writes it to a staging partition, validates it, and reboots into the new firmware. ESP-NOW is Espressif’s proprietary peer-to-peer radio protocol. It lets two ESP32s talk directly without joining a WiFi network, but it caps payloads at 250 bytes per frame, which means a 742 KB firmware turns into thousands of frames.

I designed a four-message protocol: OTA_BEGIN (carries total size and version string), OTA_CHUNK (sequence number plus up to 200 bytes of payload, leaving room for a small header), OTA_END (CRC32 over the whole image), and OTA_ACK (sent back from responder to initiator after each chunk). The responder uses ESP-IDF’s standard OTA partition layout: a factory slot, two OTA slots (ota_0 and ota_1), and a small otadata partition that the bootloader reads to decide which slot to boot next.

Most of the three days went to build-system and partition-table work, not protocol logic. ESP-IDF defaults to a 2 MB flash size and a single-app partition table; getting it to a 4 MB layout with two OTA slots required setting CONFIG_ESPTOOLPY_FLASHSIZE_4MB=y first, before any other partition-table edits, because idf.py set-target esp32 silently resets partition-related config values. I also hit a build bug in components/esp_https_ota/CMakeLists.txt (missing bootloader_support in its REQUIRES list, fixed with a one-line patch to the IDF) and a UART driver conflict where my initiator’s uart_read_bytes() calls fought the console driver for UART0. Fixed by switching to fgetc(stdin), which goes through the VFS layer and shares the UART cleanly.

What worked, what didn’t

  • Worked: Both devices boot, discover each other, and complete the protocol handshake. OTA_BEGIN and OTA_ACK round-trip cleanly. The staging partition is found, erased (~3 seconds), and written to. Payloads of 1 byte, 100 bytes, and 1 KB stage with 100% reliability across repeated runs.
  • Didn’t work: The full 742 KB push reaches ~720,896 bytes written (97.1%) and then the responder crashes and reboots before OTA_END arrives.
  • Diagnosis (educated guess, not confirmed): Memory pressure in the staging loop. Each iteration allocates a 4 KB receive buffer, reads into it, writes via esp_partition_write(), and frees it. After roughly 700 KB of these cycles, a subsequent allocation likely fails. I did not instrument with heap_caps_get_free_size() before stopping, so this is consistent with the symptoms but not proven. Probable fixes: move the staging loop to a dedicated FreeRTOS task with explicit yields, or switch from raw esp_partition_write() to the buffered esp_ota_write() path that Espressif uses in their own OTA examples.

What I learned

The most useful habit I picked up was logging the input, the output, and the byte count at every protocol boundary from the start, not as a debugging step after something breaks. ESP-NOW’s error messages are uniformly unhelpful (“Receive error data” can mean CRC mismatch, length mismatch, or struct-layout mismatch with no way to tell), and the only reliable diagnostic is comparing hex dumps on both ends. Adding that logging once cost an hour; not having it cost most of day one.

The bigger lesson was procedural. I went straight to a 742 KB payload because that’s the “real” test, and the resulting crash was hard to diagnose because the protocol logic, the partition setup, and the memory behavior were all unproven at the same time. A 1 KB smoke test first would have isolated the memory problem cleanly. Next OTA project: validate the smallest end-to-end path before scaling up the payload.

  • Code: /home/drain/esp/projects/espnow_ota/
  • Related entry: Two-Board ESP32 Lab: ESP-NOW Foundations, the radio layer this protocol builds on

LLM Backend Eval Round 2: K8s Triage Scenarios

June 2026, Complete
llama.cpp MLX oMLX LiteLLM qwen3-14b judge vcluster Kubernetes

TL;DR

  • What: Round 2 of the local LLM backend eval. 13 model and backend combinations run against 20 hand-written Kubernetes triage scenarios (260 rows total), with qwen3-14b-tuned as the LLM-as-judge. Composite quality is format-gated and built from deterministic verifier passes, action safety, and judge rubric scores.
  • Outcome: Stay on llama.cpp confirmed. llama matches or beats MLX on quality in 5 of 6 fairly-comparable model families and is 10 to 64 percent faster on throughput in the same 5. Default cluster model: granite-4.1-8b-tuned at 0.79 composite quality with a 12 second median wall time.
  • Why interesting: The first eval flagged the headline (no migration); this one stress-tested it on a workload that mirrors how the cluster actually uses these models. It also surfaced operational fragility in oMLX (two Metal-OOM crashes from mid-cohort swap-evict races) that doesn’t appear in throughput-only benchmarks.
  • Lesson: Production-shaped scenarios surface failure modes that synthetic benchmarks miss. The two crashes, three measurement bugs, and the granite-8b discovery all came from running the eval through the real cluster path, not from running models in isolation.

What I built

The first eval established that MLX did not justify migration on general workloads. The open question was whether the verdict held on the specific shape of work the cluster runs most often: short structured Kubernetes triage prompts (0.5 to 8 KB snapshot dumps, structured answers, deterministic verification possible). Bench-v2 was built to answer that.

The setup ran 13 model and backend combinations across 20 hand-written K8s triage scenarios (single rep per cell, 260 rows). The scenarios target a known fault in a vcluster sandbox (bench-v2-sandbox on K8s 1.35.0), and each model is asked to identify the fault and propose a safe action. Scoring is format-gated: format compliance must be strict before the composite is computed; the composite then averages deterministic pass, action safety, and three judge-derived scores. The judge model is qwen3-14b-tuned running on llama-server.

Three sweeps ran in sequence. A 13-cohort all-in-one (PID 88350) crashed silently because oMLX died before launch, which surfaced the need for an omlx_alive precheck and PYTHONUNBUFFERED=1. A llama-only resume ran clean. An MLX-only resume used a fresh oMLX process per cohort to dodge the swap-evict OOM that hit on May 31 and again on June 3 during mid-cohort model evictions.

What worked, what didn’t

  • Worked: Verdict held across the production-shaped workload. In 5 of 6 fairly-comparable families, llama matched or beat MLX on quality and was 10 to 64 percent faster on throughput. Per family: qwen3-14b tied at 0.40 vs 0.39 (llama 10% faster), qwen3-8b won on quality 0.36 vs 0.26 (MLX 11% faster, did not compensate), mistral-nemo-12b tied at 0.60 vs 0.59 (llama 20% faster), gemma3-12b tied at 0.81 vs 0.79 (llama 51% faster). No configuration where MLX is the right call on this workload.
  • The granite-8b discovery: granite-4.1-8b-tuned landed at 0.79 composite quality (tied for second overall) with a 12 second median wall time, the best speed-at-quality tradeoff in the lineup. It was added late as a v2 addition; in hindsight it should have been a Phase 0 baseline.
  • Operational fragility in oMLX: Two Metal-OOM crashes during the eval, both caused by swap-evict races during mid-cohort model swaps. The fresh-oMLX-per-cohort discipline works but is overhead that llama-server doesn’t impose. Production use would need either single-model invocations or the per-cohort restart discipline as standing practice.
  • Three measurement bugs that bound the conclusions but don’t change the headline:
    • Granite-4.1-30b: judge integration fails 40 of 40 calls (judge_unavailable) for both backends, suppressing composite quality by about 0.4 points purely from missing judge data. Deterministic pass is 90 percent on both, the best of any model. Granite-30b may actually be the highest-quality model in the lineup. Filed as TODO.
    • Gemma4-e4b on llama-server: 14 of 20 scenarios return unparseable format compliance, vs 95 percent strict under MLX with the same weights. Almost certainly a sampling, chat-template, or EOS-token mismatch in the llama-server invocation. The gemma4-e4b row in the per-family verdict is not a backend comparison until that’s fixed.
    • Three scenarios fail at 0 to 23 percent deterministic pass across all 13 models while remaining 100 percent action-safe: cf_01_missing_cm (0%), cc_03_image_pull_secret_missing (0%), rbac_02_wrong_verb (23%). When every model produces well-formed answers that the verifier rejects, the truth definition is more likely wrong than the models. Likely verifier bugs in verifier.py.
  • Excluded by design: Single rep per cell, deterministic and quality bucket only (no throughput characterization). The plan called for 5 reps in the throughput bucket; that’s separate work. The judge was not held out (qwen3-14b-tuned’s own scores include judge calls on its own outputs), acceptable for a relative comparison.

What I learned

The mental model shift from this round was operational, not technical. MLX’s theoretical prefill and TTFT edge did not show up as throughput wins on this prompt shape (short structured prompts, short structured answers). At matched 4-bit quant, quality is genuinely indistinguishable across 5 of 6 families within ±0.10 composite. The “Q4_K_M vs MLX -q4(gs=64) is a polite fiction” caveat from the plan held: the differences are real but small.

The operational fragility finding is the one I would not have predicted. Both crashes came from swap-evict races during model rotations, which is the path a real serving setup exercises but a throughput-only post never does. The granite-8b find was a happy accident: added late as a v2 baseline, it won the speed-at-quality leaderboard. The lesson worth carrying is to include smaller, less-hyped models as baselines; they sometimes lap the field.

  • Decision doc and raw results: ~/Projects/01-homelab/mbp-local-inference/bench-v2/decision.md
  • Eval plan: ~/.claude/plans/i-currently-use-litellm-frolicking-wave.md
  • Related entry: Local LLM Inference Backend Eval: MLX vs llama.cpp on M1 Pro (round 1, surfaced the silent audio bug)

Local LLM Inference Backend Eval: MLX vs llama.cpp on M1 Pro

May 2026, Complete
llama.cpp MLX oMLX LiteLLM Tailscale Python jiwer

TL;DR

  • What: Side-by-side eval of llama.cpp and MLX (via oMLX) as the on-prem inference backend for my homelab: coding daily-driver, Obsidian vault chat, long-context planner, multimodal experiments. Four-metric weighted scoring (quality 0.40, decode tok/s 0.25, TTFT 0.20, peak RSS 0.15) with blind quality grading.
  • Outcome: Zero models met the migration threshold. All six evaluable classes plus DeepSeek-R1-32B verdict to stay on llama.cpp. The most extreme gap was Mistral-Nemo TTFT at 89 ms vs 1,687 ms (~19×).
  • Why interesting: The eval started as a tok/s benchmark and ended by surfacing a silent audio bug in oMLX: every audio clip returns the same hallucinated English sentence with HTTP 200 and no diagnostic. A migration based on throughput posts alone would have silently lost capability.
  • Lesson: Weight quality higher than throughput when the decision is “switch backends,” and blind-score every quality judgment. Without the 40% quality weight, two models would have flagged “migrate” on speed and lost real capability silently.

What I built

The question was simple: should I move my self-hosted LLM stack (coding daily-driver, Obsidian vault chat, long-context planner, and Gemma 4 multimodal experiments) from llama.cpp to Apple’s MLX framework? MLX benchmark posts had been claiming significant wins on Apple Silicon, and the current llama.cpp setup behind a LiteLLM proxy works fine but isn’t the point of comparison those benchmarks use. I wanted to know whether the gains were real for the workloads I actually run.

The eval rig ran side-by-side with the production stack, never interrupting cluster traffic. Both backends sat behind the same LiteLLM proxy, so the harness measured the end-to-end path the cluster uses, not raw server throughput. Per-model llama-server instances were started and stopped on demand by a small shell manager because 32 GB of RAM wouldn’t hold them all resident; oMLX served all MLX models from a single process with LRU eviction. A small Python harness wrote one CSV row per streamed request, capturing TTFT, decode tok/s, peak RSS, wall time, and both backend versions.

Three methodology calls mattered. First, four weighted metrics with quality at 0.40: a faster wrong answer is worse than a slower right one, and the weight explicitly prevents “MLX wins on tok/s” from auto-meaning “migrate.” Second, blind quality scoring: the scoring CLI hides which backend produced each completion until after the rubric score is submitted, preventing me from unconsciously favoring whichever backend I wanted to win. Third, matched-pair sweeps with per-cohort orchestration so the RAM ceiling never got violated; one non-production server resident at a time. Multimodal evaluation went through a separate runner using OpenAI-format input_audio and image_url content types, scored with jiwer-WER against ground-truth transcripts for audio and the same blind rubric for text and image.

What worked, what didn’t

  • Worked: Six matched model pairs, 780 measurement rows, production stack untouched throughout. All six classes plus DeepSeek-R1-32B verdict to stay on llama.cpp. Qwen3-14B (coding) warm TTFT 318 ms vs 3,180 ms (10×). Qwen3-8B tied at gap 0.000. Mistral-Nemo TTFT 89 ms vs 1,687 ms (19×, the most extreme in the eval).
  • The audio finding: oMLX 0.3.12, despite bundling mlx-audio, returns the same canned hallucination for every audio clip: a fixed English sentence regardless of input. Diagnosis is a mel-filter configuration mismatch in oMLX’s preprocessing producing degenerate features, so the model falls back to a generic completion. Failure is silent: HTTP 200, plausible text, no diagnostic to the client. llama.cpp transcribes the same clips at WER ~0.15 (~85% accuracy). A reproduction + diagnostic was drafted as a GitHub issue against the oMLX repo.
  • Honest concessions: RSS systematically under-reports MLX because Apple’s unified-memory model parks Metal buffers under wired_count, not the process RSS field psutil samples, so the 0.15 RSS weight effectively favored MLX by an unknown margin. A proper workaround would sample vm_stat wired-page deltas; logged as follow-up. Sample size is also small (1 to 7 quality scores per text combo, 15 reps per audio combo), directional rather than statistically significant.
  • Excluded by design: Single-stream measurement only; oMLX claims good multi-request behavior, but verifying that is a separate eval. Quant parity is also a polite fiction: Q4_K_M (llama.cpp) and MLX -q4 (gs=64) have different bit-budgets and error distributions. I judged each backend at its “recommended 4-bit” rather than normalizing the difference away.

What I learned

The framework that made the eval credible was the 40% quality weight plus blind scoring. Without either piece, the verdicts would have drifted. With both, the speed gaps that looked decisive on paper (10×, 19× TTFT) still didn’t justify a migration because the quality side didn’t separate the backends enough to matter. The plan agent predicted “partial migration, multimodal only” before any data came in; the data inverted that. I’d reach for the same framework for the next backend decision and trust the data over the prior.

The audio bug is the meta-lesson about benchmark-only evaluation. Throughput posts wouldn’t have caught it. The MLX path returns plausible text fast, and only running real audio through the production code path with WER scoring against ground truth made the failure visible. The scope of the bench mattered more than the metrics in the bench.

One scope caveat worth being explicit about: this eval was on an M1 Pro with 32 GB. MLX’s gains scale with memory bandwidth, and a higher-bandwidth Apple Silicon part (M3 Max, M4 Max, M-Ultra) could plausibly flip multiple verdicts. The conclusion is “stay on llama.cpp for this hardware,” not “MLX is worse.”

  • Code: harness, prompt sets, multimodal corpus, scoring CLIs, decision matrix calculator (all reusable for future backend evals)
  • oMLX audio bug: reproduction + diagnostic drafted as a GitHub issue against the oMLX repo
  • Related entry: LLM Backend Eval Round 2: K8s Triage Scenarios (follow-up that validated the verdict on a production-shaped workload)