back to homelab

Debian LUKS Locked Me Out of My Cluster for 14 Hours

ImpactTalos cluster and all hosted services unreachable for ~14 hours

TL;DR

  • What happened: The Debian server that NATs and Tailscale-routes my cluster’s isolated LAN rebooted overnight and stopped at the LUKS passphrase prompt.
  • Impact: Talos cluster and every hosted service unreachable for roughly 14 hours, until I was physically at the keyboard.
  • Root cause: Full-disk encryption requires a human at boot. The Debian server is the cluster’s parent dependency, so its unavailability cascaded to everything downstream.
  • Lesson: Encrypted single points of failure trade availability for confidentiality on purpose. Right response is to size the UPS, write the runbook, and decide deliberately how much theft protection is worth.

Impact

The Debian box does three jobs I cannot easily replace: it NATs the isolated 192.168.2.0/24 LAN the Talos nodes live on, advertises that LAN as a Tailscale subnet route so I can reach the cluster from anywhere, and runs the Tailscale Operator proxies that bridge in-cluster services back out to my tailnet. When it sat at the LUKS prompt overnight, the cluster lost its only path to the internet, every public hostname under my domain stopped serving, and every tailnet-published service went dark. Detection lag was the entire workday I was not at home, with no monitoring path to reach me because the monitoring lives behind the same dead gateway.

Timeline

  • ~22:00 2026-06-19. Reboot trigger. Exact cause unconfirmed (educated guess: brief power event). Server hit the LUKS prompt and waited.
  • All night. Cluster, public services, and tailnet bridges offline. No alerts fired; the alert path runs through the same Debian box.
  • 11:49 2026-06-20. Tried kubectl get nodes from my laptop. Five timeouts in a row pointed at the control plane IP being unreachable.
  • 11:51 2026-06-20. tailscale status showed Debian and seven downstream services all marked “offline, last seen 14h ago.” Coordinated outage signature, not a per-service issue.
  • ~12:08 2026-06-20. At the box. Plugged in a keyboard, typed the passphrase.
  • 12:10 2026-06-20. Debian back on the tailnet. Four-minute uptime confirmed the LUKS theory.
  • 12:14 2026-06-20. Cluster API reachable, all three nodes Ready, recovery confirmed.

Root cause

LUKS2 full-disk encryption requires manual passphrase entry at every boot, by design. systemd-cryptsetup halts the sequence until a passphrase arrives, and the Debian server has no TPM-bound key, no network-fetched key, and no out-of-band KVM. The one way past the prompt is physical presence.

The deeper issue is that I had treated the Debian server as a peer of the cluster rather than a parent. In practice it is upstream of everything: the cluster’s egress, the tailnet’s reachability into the cluster, and the alerting path that would have told me any of this. A single LUKS prompt took down eight services at once because they all transitively depended on this one machine’s network being live. What actually triggered the reboot is still unconfirmed; I have not yet read last -x and journalctl -b -1 on the box.

Fixes

  • Immediate: Physical passphrase entry. Everything downstream came back automatically once Debian’s network was up; no per-service intervention.
  • Structural: A UPS was already in my office at the time of this outage, still in the box. The earlier Charleston outage was what made me order it; I just had not yet plugged it in. This outage was what finally got it installed. UPS alone is necessary but not sufficient: it does nothing for kernel panics, hardware faults, or outages longer than the battery runtime.

Open questions

  • Clevis + Tang for network-bound unlock. Stand up a Tang server on a separate LAN device; Clevis on Debian fetches the unlock key at boot. Works as long as the network is up. The Tang host then becomes a new single point of failure with its own unlock story.
  • TPM-bound LUKS via systemd-cryptenroll. Bind the LUKS key to the TPM2 chip with a PCR policy. Auto-unlocks on this hardware as long as the boot chain has not been tampered with. Defeats the physical-theft threat model. Acceptable if the box never leaves the house.
  • Accept manual unlock as a deliberate availability cost. Be physically present for reboots; treat it as runbook rather than automation. Defensible if I value the data-at-rest protection more than a few hours of yearly downtime.
  • Out-of-band management. A Pi KVM or IPMI on different hardware would let me type the passphrase from anywhere without changing the security model. Bigger hardware investment than the other options.
  • What actually rebooted the box overnight. Not yet investigated. The right structural fix depends partly on whether this was a power event, a kernel update, an unattended-upgrades reboot, or a hardware fault.

Lessons

Identify parent dependencies before they bite. I knew Debian was the NAT gateway; I had not internalized that the entire monitoring and alerting path also ran through it, which is what made the outage invisible until I happened to look. A “what depends on what, transitively” exercise would have surfaced this.

Coordinated outage signatures are diagnostic. When eight things go dark at the same instant, the cause is almost always a shared dependency. The tailscale status output with “last seen 14h ago” clustered across every downstream service was the smoking gun, and it took under a minute to read once I knew what to look for. Worth keeping in the first-look toolkit. Security and availability sit in tension, and the answer is rarely to pick one. The next version of this writeup needs to name the threat model explicitly so the tradeoff is visible.

postmortemavailabilityencryptionsingle-point-of-failure