Zero-downtime deploys (infra)
How the infra/ Ansible deploy avoids dropping requests during a restart.
Status update (socket activation now implemented). Parts of the prose below
were written before the Phase-1 merge and assumed a single/healthz. Two things
changed:
- The app exposes
/livez(liveness — 200 while the process is alive, incl.
warmup) and/readyz(readiness — 503 during warmup/drain) on the internal
port, plus the legacy/healthz.- Socket activation is implemented (
systemdListenerincmd/server/main.go
inherits theLISTEN_FDSfd;infra/socket.j2defines the per-service
.socket). The listening socket now outlives a restart, so connections queue
in the backlog instead of being refused — true zero-downtime on one server,
no LB, no SQLite contention. The Traefik active check targets/livez
(not/readyz): with a single backend the socket queues the warmup, so the
check must keep the backend in rotation and only catch a genuinely dead process.Canonical references:
docs/dev/zerodowntime_research.md(E1–E7 experiments) and
docs/en/user/zerodowntime.md(user recipes). The sections below describe the
earlier health-check + drain wiring, kept as the fallback path.
Current deploy flow
Single server. Ansible (infra/site.yml) copies one shared binary to
/opt/trip2g/bin/app, then systemctl restarts each per-site systemd unit
(trip2g, trip2g_landing, trip2g_demo2, trip2g_founder,
trip2g_internaldev, ...). Each site is independent: its own domain, its own
SQLite DB, its own /etc/{service}.env. Traefik (v3, file/dynamic provider,
infra/traefik_dynamic.yml.j2) routes each domain to a single backend.
App contract (verified, this branch)
The app already exposes the signals the infra needs — no /livez + /readyz
split, the single /healthz already encodes readiness:
| Signal | Where | Behaviour |
|---|---|---|
/healthz |
internal port (--internal-listen-addr) |
200 while serving; 503 the instant SIGTERM arrives (a.stopped), and during warmup |
| drain | on SIGTERM | flip /healthz → 503 → sleep(--shutdown-grace-period) → graceful HTTP Shutdown (bounded by --shutdown-timeout) |
| public traffic | --listen-addr |
separate port from the internal/health one |
Flags (env equivalents use the TRIP2G_ prefix, e.g. TRIP2G_INTERNAL_LISTEN_ADDR;
CLI args on ExecStart take precedence over env):
--listen-addr— public port--internal-listen-addr— health/metrics port (default:8082)--shutdown-grace-period— drain window (app default 50ms; we set5s)--shutdown-timeout— graceful-shutdown bound (app default 1s; we set10s)
There is no --simple-backup-on-shutdown or --vacuum-cron flag. Shutdown
backup runs only if simple-backup is configured for that site; on the rolling
deploy path the restarted unit takes over, so no extra backup flag is needed.
What is wired today (gets us to "no in-flight loss", ~99%)
- Per-site internal health port (
infra/site.yml,service.j2). Each
service gets a uniqueinternal_addr(localhost:1908x). Without this every
unit would collide on the app default:8082and health checks would be
meaningless. /healthzactive health check on the internal port for every trip2g
loadBalancer(traefik_dynamic.yml.j2):path: /healthz,interval: 1s,
timeout: 1s. Detection window ≈ 1–2s.- Drain ≥ detection window.
--shutdown-grace-period 5s> the ~2s Traefik
detection window, so on SIGTERM Traefik pulls the backend out of rotation
before the socket closes.TimeoutStopSec=20(> grace + shutdown-timeout)
plusKillSignal=SIGTERM/KillMode=mixedguarantee systemd never SIGKILLs
mid-drain. This is the fix for the old "forever to restart" hang: the old
binary no longer blocks the critical path, and systemd waits out the bounded
drain instead of hanging. retry-idempotentmiddleware (retry: { attempts: 4 }) on every trip2g
router.
Honest limit: retry is a near no-op with one backend
Traefik's retry retries against the server pool. With a single backend per
site, every retry during a restart resolves to the same just-killed backend, so
the unbound-port window between old-process-exit and new-process-listen still
refuses connections. The health check + drain shrink in-flight request loss to
near zero, but the restart gap itself is not closed by anything wired today.
retry becomes genuinely useful only once a second backend exists (see below);
it is kept now because it is harmless and covers transient upstream blips.
The step to true 100%
Two ways to cover the unbound-port restart gap. For this topology
(per-site embedded SQLite, one shared binary, file-provider Traefik),
socket activation is recommended and blue-green is discouraged.
Recommended: systemd socket activation (needs a small app change)
A trip2g@.socket owns the public port and outlives the service restart, so
the kernel queues incoming connections (extra ms of latency) instead of refusing
them; the new process inherits the already-bound fd via LISTEN_FDS.
- Exactly one process per site DB at all times → no SQLite contention.
- Unit count unchanged; Traefik config unchanged.
- Follow-up app code change required: the server currently calls
ListenAndServe(--listen-addr). It must instead use the passed listener fd
whenLISTEN_FDS/LISTEN_PIDare set (Go:net.FileListeneron fd 3),
falling back to--listen-addrotherwise. Until that lands, socket activation
cannot be enabled.
Discouraged here: blue-green
Two units per site (trip2g@blue/@green) on two ports, both as Traefik
backends with the /healthz check; release starts the idle colour, waits for
/healthz=200, then stops the old. This is where the retry middleware finally
earns its place. But it is the wrong tool for embedded per-site SQLite: blue
and green run simultaneously during the cutover, and the DB is opened with
busy_timeout=20000 (single writer). A write on the draining old colour during
handoff makes the other side block up to 20s / hit SQLITE_BUSY. It also doubles
the unit count and forces the release task to track the current colour. If ever
adopted, scope it to the flagship site only.
Recommended release sequence
ansible-playbook -i hosts site.yml -e ...builds/uploads the binary and
re-renderstraefik_dynamic.yml(health check + retry) and the units.- The
Restart servicehandlersystemctl restarts each unit. Per unit:
SIGTERM →/healthz503 → Traefik drops it within ~2s → app drains 5s →
graceful shutdown → new process starts and warms →/healthz200 → Traefik
routes again. - Deploy Traefik config (
/etc/traefik/dynamic.yml); the file provider
hot-reloads it (no Traefik restart needed for dynamic changes).
To verify a single site live (do not run against prod casually):
curl -s localhost:1908x/healthz should be ok; during a restart it returns
shutting down (503) for the grace window.
Risks / follow-ups
- Restart gap is still open until socket activation lands (the one app-code
follow-up). Today's wiring is "minimal+drain", honestly ~99% with idempotent
retry tails, not a hard 100%. internal_addrports (1908x) are bound on localhost only and must stay
unique per service and free of UFW exposure (they are not opened — only 80/443
are).