Homelab

~/homelab > whoami

I run a small production-grade homelab: a three-node Talos Kubernetes cluster on bare-metal Lenovo M700s, fronted by a Debian server that handles NAT, Tailscale subnet routing, and a few services that do not belong in Kubernetes. That server also runs the NVR stack (go2rtc, Frigate, and a Coral USB TPU for on-device object detection), fed by PoE cameras on a separate wired network isolated from the house LAN.

The cluster hosts Git, task tracking, a self-hosted social scheduler, a workflow engine, an object store, and a personal agent stack. Small enough that I understand every layer. Large enough that real failures happen, which is the point.

What I run, and why →

~/homelab > topology

flowchart TB Internet([Internet]):::external CF[Cloudflare DNS
niard.cloud]:::external Tailnet([Tailscale Tailnet]):::external Cameras[("PoE Cameras")]:::external subgraph DebianHost["Debian Server (Lenovo M700)"] direction TB DebianEdge["NAT • Subnet Router"]:::edge subgraph NVR["NVR Stack"] direction LR Go2rtc["go2rtc
RTSP proxy"]:::app Frigate["Frigate"]:::app Coral[("Coral USB TPU
object detection")]:::platform end Go2rtc --> Frigate Frigate -.-> Coral end subgraph Cluster["Talos Cluster (3x Lenovo M700)"] direction TB GW["Cilium Gateway API
L2 LoadBalancer Pool"]:::platform subgraph Platform["Platform Layer"] direction LR Longhorn[(Longhorn
Storage)]:::platform Cert[cert-manager]:::platform ExtDNS[ExternalDNS]:::platform Prom[Prometheus
+ Grafana]:::platform end subgraph Public["Public Services"] direction LR Forgejo[Forgejo]:::app Vikunja[Vikunja]:::app Postiz[Postiz Stack]:::app Windmill[Windmill]:::app MinIO[MinIO]:::app end subgraph VClusters["vclusters"] direction LR subgraph Agent["Agent Stack"] Hermes[Hermes Agent]:::agent Synapse[Synapse Matrix]:::agent end subgraph Bench["Bench Sandbox"] Evals[Eval Rigs]:::sandbox Spare[Spare Rigs]:::sandbox end end GW --> Public GW --> Agent Public -.-> Longhorn Agent -.-> Longhorn Bench -.-> Longhorn end Internet --> CF CF --> GW Internet -.->|Wireguard| Tailnet Tailnet --> DebianEdge DebianEdge --> Cluster Cameras --> Go2rtc classDef external stroke:#64748b classDef edge stroke:#fbbf24 classDef platform stroke:#4ade80 classDef app stroke:#60a5fa classDef agent stroke:#a78bfa classDef sandbox stroke:#94a3b8 classDef vclusterGroup stroke:#a78bfa,stroke-dasharray:4 4 class VClusters vclusterGroup

~/homelab > vclusters

The two vclusters in the topology above are what happens to be running today, not a fixed part of the stack. vcluster lets me spin a fresh control plane on the host cluster in a few minutes whenever I need one: a benchmark rig, an upgrade rehearsal, a sketchy experiment that deserves its own blast radius. Each one gets its own API server and CRDs without sharing state with the rest of the cluster. When the work is done I delete the vcluster and the host is left clean.

~/homelab > inference

I don’t have gobs of money to throw at frontier labs, and I also don’t trust any of them. I try to keep anything sensitive or novel that I don’t want to share with the world locally. Inference runs on my MacBook Pro M1 with 32GB of unified memory, fronted by llama.cpp and MLX. Local sweet spots: gemma3-12b for general reasoning, qwen3-14b for code-shaped tasks, parakeet-mlx for ASR.

LiteLLM sits in front of every backend as a single OpenAI-compatible gateway. Agents in the cluster reach it over Tailscale and ask for a model by name; LiteLLM decides whether the request stays on the M1 or fans out to a cloud provider. That decoupling means I can swap a backend, retune routing, or pin a per-model budget without touching any agent code. When a request needs a frontier model the M1 cannot hold (large context, kimi-k2.5-class capability, heavy tool-use loops), LiteLLM falls back to NVIDIA NIM or OpenCode Go. Both are gated by a dollar budget at the router, so a runaway loop cannot drain them.

flowchart LR Agents["Homelab Agents
Talos + vclusters"]:::agent subgraph M1["MacBook Pro M1 / 32GB"] direction TB LiteLLM["LiteLLM Router"]:::edge Backends["llama.cpp · MLX"]:::platform Models["gemma3-12b · qwen3-14b · parakeet-mlx"]:::app LiteLLM --> Backends Backends --> Models end subgraph Cloud["Cloud Fallback"] direction TB NIM["NVIDIA NIM"]:::external OC["OpenCode Go"]:::external end Agents -.->|Tailscale| LiteLLM LiteLLM -->|frontier models| Cloud classDef external stroke:#64748b classDef edge stroke:#fbbf24 classDef platform stroke:#4ade80 classDef app stroke:#60a5fa classDef agent stroke:#a78bfa

~/homelab > status

Forgejo 100.00%

Git server

Vikunja 100.00%

Task tracker

Postiz 100.00%

Social scheduler

Windmill 100.00%

Workflow engine

MinIO 100.00%

Object store

Grafana 100.00%

Metrics frontend

Talos API 100.00%

Cluster control plane

Debian Gateway 100.00%

NAT + subnet router

Probed Jun 22 16:10 UTC by an in-cluster CronJob. This is a static build, so the grid is point-in-time: it reflects whatever snapshot was current the last time the site was rebuilt.

Uptime is averaged over a 30-day rolling window. Probes started Jun 22, 2026, so current figures reflect only the data collected since then.

~/homelab > metrics

Cluster CPU (24h)

25.5% min 25.1 · max 25.8

Node memory used

talos-cp-01 14%

talos-worker-02 46%

talos-worker-01 45%

Pods running

Snapshot Jun 22 16:10 UTC from in-cluster Prometheus. Static build, so this is point-in-time: it reflects whatever snapshot was current the last time the site was rebuilt.

~/homelab > field-notes

changelog

What I Run, and Why

The reasoning behind the choices in my homelab: why Talos, why Debian as the gateway, why local inference, and where convenience and control trade off.

read

postmortem earlier this spring

A Power Outage Cut My Access To My Homelab During Vacation

A power outage during a week-long Charleston vacation took down the Debian server that fronts my Talos cluster. With LUKS at boot and no UPS to ride out the flicker, my remote access to the homelab stayed cut for the whole trip. I came home, ordered an APC Back-UPS 600, and then put off plugging it in. A second outage a few days later was what finally got it installed.

Remote access to homelab cut for the full 7-day trip while services kept running locally read

postmortem 2026-06-19 to 2026-06-20

Debian LUKS Locked Me Out of My Cluster for 14 Hours

An overnight reboot on the LUKS-encrypted Debian server that fronts my Talos cluster halted at the passphrase prompt. The cluster and every hosted service were unreachable until I drove to the box and typed it in.

Talos cluster and all hosted services unreachable for ~14 hours read