Process isolation for the fleet code-executor
TL;DR / вывод. The fleet code-executor already does the cheap, portable basics right (scrubbed env, temp cwd, context timeout, stdout cap). The next gains are tiered:
- Now, cheap + (mostly) portable — add POSIX rlimits (
RLIMIT_CPU/FSIZE/NOFILE/NPROC,ASas a coarse guard) and network-off-by-default via a Linux net namespace, plus run the child under a dedicated unprivileged uid. Go-native, no daemon, no root (with user namespaces).- Self-hosted-Linux hardening — wrap the allowlisted program in
nsjail(orbubblewrap): one binary that bundles namespaces + cgroups v2 + rlimits + seccomp. Equivalent hand-rolled stack:SysProcAttrnamespaces +containerd/cgroups+elastic/go-seccomp-bpf+go-landlock, applied in a thin re-exec child launcher (never in the daemon).- Untrusted role-authors — gVisor (
runsc) for the best strength/effort ratio (user-space kernel, no VM, OCI drop-in), or a microVM (Firecracker / Kata) for true kernel-level isolation. If role code can be compiled to Wasm, wazero is the strongest and cheapest option (in-process capability sandbox, pure Go, no root).Two load-bearing facts for Go: namespaces are first-class in
syscall.SysProcAttr(free isolation, no external binary), but seccomp and tight rlimits do NOT compose with the multithreaded Go parent — they must be applied in the child just beforeexec, not in the long-running daemon.
1. Baseline (already shipped) and threat model
The executor spawns operator-allowlisted programs with exec.CommandContext on the role's rendered code. Already in place:
- Secret-scrubbed minimal env —
PATH+FLEET_INPUT+ only role-declaredenv_passthrough/env_prefix. - Per-run throwaway temp workdir (
os.MkdirTemp→Cmd.Dir, deleted after). - Role timeout (
exec.CommandContextcancellation). - stdout byte cap (limited writer).
- Off by default (
--allowed-programs).
Threat model. Self-hosted, typically Linux. Role notes are operator-trusted, but the code they render can be adversarial or prompt-injected. So the boundary we care about is: a hostile allowlisted-program invocation must not exfiltrate secrets, reach the network, escape the workdir, exhaust the host, or pivot to the host kernel/other tenants.
What the baseline does not yet stop: unbounded CPU/RAM/PIDs/disk inside the timeout window, network access, reading anything the executor's uid can read (the temp cwd is not a security boundary — the child can chdir away or use absolute paths), and dangerous syscalls.
2. Universal / Go-native layer (portable)
What os/exec + the standard library + golang.org/x/sys/unix give you without any external tool or root. This is the cheap layer to maximize first.
Already portable and shipped
- Minimal env (
Cmd.Env), context timeout (exec.CommandContext), temp cwd (Cmd.Dir), output caps (wrap the reader/io.LimitReader/limited writer). All cross-platform (Linux/macOS/Windows). - Working-dir scoping is not a confinement boundary on its own — it only sets the initial cwd. Real fs confinement needs Landlock / mount-ns / chroot (below).
Resource limits — syscall.Setrlimit / unix.Prlimit
POSIX resource limits are the portable-ish way to cap a single process:
| rlimit | Caps | Notes / caveats |
|---|---|---|
RLIMIT_CPU |
CPU seconds | Backstop for the wall-clock timeout (a fork-bomb child can dodge a context deadline). |
RLIMIT_AS |
Virtual address space | Coarse only — limits virtual, not resident, memory. Go and modern allocators reserve huge virtual ranges → false OOM. Use cgroup memory.max for a real RAM cap. |
RLIMIT_DATA |
Heap/data segment | Similar virtual-memory caveat. |
RLIMIT_FSIZE |
Max file size the process can create | Cheap disk-write guard. |
RLIMIT_NOFILE |
Open file descriptors | Cached by the Go runtime and restored for children before exec (see Go #66797/#67184). |
RLIMIT_NPROC |
Processes per real uid | Anti-fork-bomb, but it's a global per-uid count, not per-sandbox → fragile if the uid is shared. Prefer the pids cgroup. |
RLIMIT_CORE / RLIMIT_STACK |
Core dumps / stack | Minor hardening. |
Portability. Setrlimit/Getrlimit/Prlimit exist on Linux, macOS, BSD via x/sys/unix; there is no Windows or plan9 equivalent — gate the code behind GOOS/build tags. Behavior also differs per-OS: macOS historically clamped RLIMIT_NOFILE and ignores some limits; RLIMIT_AS enforcement is weak/absent on Darwin; RLIMIT_CPU soft-limit semantics are Linux-specific. (Go syscall/rlimit.go; golang/go #30401, #5949.)
The Go footgun — limiting the child only. syscall.Setrlimit applies to the calling process, and limits are inherited across fork/exec. SysProcAttr has no rlimit field. So:
- Calling
Setrlimitin the daemon limits the daemon too (bad — it's long-running). - Cleanest options:
prlimit(1)wrapper as argv[0]:prlimit --cpu=10 --as=... --nofile=64 --nproc=16 -- <program> <args>.- Re-exec helper pattern (idiomatic Go, what runc does): the daemon re-execs
/proc/self/exewith a hidden subcommand; the child sets its own rlimits viaunix.Setrlimit, thenexecs the real program. This is also where you apply seccomp/Landlock/NO_NEW_PRIVS (see §5). unix.Prlimit(childPID, …)from the parent afterStart()— racy (window between exec and the call) and needsCAP_SYS_RESOURCEwhen uids differ.
Network-off, the portable-on-Linux way
There is no cross-platform "no network" switch in os/exec. On Linux it is one flag (§4, net namespace). On macOS/Windows you'd need a higher-level sandbox or firewall. So "network off by default" is effectively a Linux feature for the fleet.
Bottom line for this layer: rlimits (via a prlimit wrapper or re-exec helper) and dropping to an unprivileged uid (SysProcAttr.Credential) are the only genuinely portable hardening steps. Real memory/PID/network/filesystem confinement is Linux-specific.
3. Linux-specific primitives — what each one actually restricts
| Primitive | Restricts | Go integration | Root needed? |
|---|---|---|---|
Namespaces (CLONE_NEW*) |
Process/host visibility, mounts, network, IPC, hostname, uids | First-class: syscall.SysProcAttr.{Cloneflags,Unshareflags,UidMappings,GidMappings,Credential,Chroot} |
No, if user namespaces are enabled |
| cgroups v2 | CPU, real memory, PIDs, IO quotas | containerd/cgroups/v2, or write cpu.max/memory.max/pids.max + add PID to cgroup.procs |
Delegated cgroup (systemd) or root |
| seccomp-bpf | Syscall allow/deny list | elastic/go-seccomp-bpf (pure Go, no cgo) or seccomp/libseccomp-golang (cgo) |
No (needs NO_NEW_PRIVS first) |
| Landlock LSM | Filesystem access; TCP bind/connect (ABI v4+) | landlock-lsm/go-landlock with BestEffort() |
No (unprivileged by design) |
| no_new_privs | Blocks privilege gain via setuid/setcap exec | unix.Prctl(PR_SET_NO_NEW_PRIVS,1,…) in the child |
No |
| Capability dropping | Ambient kernel capabilities | Best: run as unprivileged uid (Credential); SysProcAttr.AmbientCaps adds, libcap drops |
No |
| pivot_root / chroot | Filesystem view | SysProcAttr.Chroot; pivot_root via mount ns |
chroot needs root; pivot_root works inside a user+mount ns |
Namespaces (the cheap structural isolation)
- net (
CLONE_NEWNET): a fresh net namespace with no veth = no connectivity at all (loopback only). The simplest, strongest "network off" — and it's just a flag inSysProcAttr.Cloneflags. - pid (
CLONE_NEWPID): child can't see or signal host processes; remount/procto hide them. - mount (
CLONE_NEWNS): private mount table; pair withpivot_rootonto a minimal rootfs for real fs confinement. - user (
CLONE_NEWUSER+ Uid/Gid mappings): maps "root-in-namespace" to an unprivileged host uid. This is what makes all of the above usable without real root (and what bubblewrap/rootless-podman rely on). Some distros gate unprivileged userns (kernel.unprivileged_userns_clone). - ipc / uts: isolate SysV/POSIX IPC and hostname.
Go can unshare all of these at spawn time natively — no external binary — which is the standout "surprisingly easy in Go" finding.
cgroups v2 (the real quota layer)
Unified hierarchy. The controllers that matter: cpu.max (CPU quota), memory.max + memory.high (real RSS cap with OOM-kill — the correct memory limit, unlike RLIMIT_AS), pids.max (process count — the correct anti-fork-bomb), io.max (disk throughput/IOPS). Needs either root or a systemd-delegated cgroup subtree.
seccomp-bpf (syscall attack-surface reduction)
Allowlist/denylist of syscalls; a denied call gets SIGSYS/EPERM. Blocks the dangerous escape primitives (mount, ptrace, keyctl, bpf, clone of new user ns, kexec, etc.). elastic/go-seccomp-bpf is pure Go (no libseccomp), calls PR_SET_NO_NEW_PRIVS for you, and uses SECCOMP_FILTER_FLAG_TSYNC to sync across the Go runtime's threads. Footgun: installing a filter in the daemon filters the whole Go runtime — apply it only in the thin child just before exec.
Landlock LSM (unprivileged filesystem confinement)
The most attractive Linux-native primitive for this use case: no root, stackable, no unmount needed. go-landlock with BestEffort() degrades gracefully on older kernels.
- Kernel 5.13+ = filesystem rules; 6.7+ (ABI v4) = TCP bind/connect; later ABIs add ioctl-on-device, IPC scoping, audit, UNIX-socket path rules.
- Limits: can restrict open/read/write/exec/create/remove/rename/truncate; cannot restrict
chdir,stat,chmod/chown,setxattr,fcntl,access. Network is TCP only (no UDP). - Go gotcha: Landlock only covers "classic" TCP, not Multipath TCP — and since Go 1.24
net.Listen()defaults to MPTCP, a Go listener can't be restricted by Landlock unless you disable MPTCP.
Minimal usage:
err := landlock.V4.BestEffort().RestrictPaths(
landlock.RODirs("/usr", "/bin"),
landlock.RWDirs(workdir),
)
4. Higher-level sandboxes
| Sandbox | Isolation strength | Enforces | Go integration | Overhead / cold-start | Root / setup |
|---|---|---|---|---|---|
bubblewrap (bwrap) |
Medium (ns + caller-supplied seccomp) | namespaces, bind-mount rootfs, seccomp (you provide BPF); no cgroups/rlimits built in | Spawn the binary | Near-zero (just a process), ms start | Unprivileged (user ns); used by Flatpak |
| nsjail (Google) | Medium-high (purpose-built for untrusted code) | namespaces + cgroups + rlimits + seccomp (Kafel) in one tool | Spawn the binary (config or flags) | Near-zero, ms start | Unprivileged or privileged modes; single binary |
| firejail | Medium, but setuid-root = added attack surface (LPE CVE history) | namespaces, seccomp, AppArmor; large desktop profile library | Spawn the binary | Low | setuid root — not recommended for server untrusted-code |
| runc / podman | Low-medium (shared host kernel) | cgroups + namespaces + seccomp + caps; OCI | Spawn / OCI libs; podman rootless | ~100s of ms; image mgmt | Rootless possible; a kernel 0-day escapes |
gVisor (runsc) |
High — user-space kernel; app syscalls never reach host kernel | full ns/cgroup quotas + intercepted syscalls (Systrap, seccomp SIGSYS, since 2023) |
runsc binary / OCI RuntimeClass (written in Go, not a library) |
~0 CPU; I/O & syscall-heavy 10–30%, up to ~125% on heavy DB; cold-start lower than a microVM (tens–hundreds ms) | Install runtime; some syscall-compat gaps |
| Firecracker microVM | Very high — true KVM VM, own guest kernel | everything (VM); host shielded by KVM | firecracker-go-sdk drives the VMM over its API socket |
~125 ms boot, <5 MiB overhead | Needs /dev/kvm (bare-metal or nested virt) + kernel/rootfs images + TAP networking |
| Kata Containers | Very high (VM, container UX) | everything (VM) via QEMU/Cloud Hypervisor | OCI RuntimeClass (drop-in) |
~150–300 ms start, ~17% runtime overhead | Needs KVM |
| WASM (wazero) | High for pure compute — capability-based, in-process, memory-safe | no ambient authority: no fs/net/env unless explicitly granted; mem cap via RuntimeConfig/ModuleConfig; CPU via context deadline |
Pure-Go library, no cgo, no external process | Lowest (in-process), fully portable | No root, no extra process — but role code must compile to Wasm (WASI p1) |
Notes:
- bubblewrap vs nsjail: bwrap is a minimal namespace+mount builder (you add cgroups/rlimits separately, e.g. via
systemd-run); nsjail bundles cgroups + rlimits + seccomp, which is exactly the untrusted-code-execution shape the fleet needs. For a single-tool Linux answer, nsjail is the better fit; bubblewrap if you want the smallest, most-audited primitive. - gVisor is the sweet spot when you want strong isolation without managing VM images/KVM: it presents as a normal OCI runtime, so it slots under Docker/containerd/k8s with a
RuntimeClass. The cost is I/O latency and occasional syscall-compat gaps. - microVMs (Firecracker/Kata) are the industry standard for multi-tenant untrusted code (Lambda, Fargate, E2B, Vercel Sandbox), but bring real operational weight: guest kernel + rootfs images, warm pools, and TAP/CNI networking (which can dominate the 125 ms boot at scale).
5. Tiered recommendation for the fleet
Tier 1 — universal baseline to add now (cheap, ~portable)
On top of what's shipped:
- rlimits on the child via a
prlimit(1)wrapper or a re-exec helper:RLIMIT_CPU(backstop the timeout),RLIMIT_FSIZE,RLIMIT_NOFILE,RLIMIT_NPROC, andRLIMIT_ASas a coarse memory guard. Unix-only — gate behindGOOS. - Network off by default (Linux):
SysProcAttr.Cloneflags |= CLONE_NEWNET(empty net ns). Pair withCLONE_NEWUSER+ uid/gid mappings so it works without root. Make network an explicit per-role opt-in. - Drop to a dedicated unprivileged uid (
SysProcAttr.Credential) if the daemon runs as root, so the child can't read the executor's secrets/files. - Set
PR_SET_NO_NEW_PRIVSin the child (prerequisite for Tier 2 seccomp and good hygiene regardless).
Cheap, mostly Go-native; the only non-portable parts (rlimits, net ns) are cleanly GOOS-gated.
Tier 2 — Linux hardening for the common self-hosted-Linux deploy
Pick one of:
- Wrap the program in
nsjail(recommended) orbubblewrap. nsjail gives namespaces + cgroups v2 (real memory/PID/CPU caps) + rlimits + seccomp from a single binary built for running untrusted code. Lowest engineering cost, no daemon, unprivileged-capable. - Hand-roll in a thin re-exec child launcher:
SysProcAttrnamespaces (pid/net/mount/user/ipc/uts) +containerd/cgroups/v2(memory.max/pids.max/cpu.max) +elastic/go-seccomp-bpf(syscall allowlist) +go-landlock(fs confinement + workdir-only writes). Apply all of it in the child after fork, before exec — never in the daemon.
Either path delivers: real RAM/PID/CPU caps (cgroups), no network unless granted (net ns), filesystem confined to the temp workdir + read-only system dirs (Landlock/mount-ns), and a syscall allowlist (seccomp).
Tier 3 — strongest, when role-authors may be untrusted
- gVisor (
runsc) — best strength/effort ratio: user-space kernel, no VM images, OCI drop-in. Accept the I/O overhead and occasional syscall-compat gaps. - Firecracker or Kata microVM — when you need true kernel-level isolation / hostile multi-tenant. Requires
/dev/kvmand image+network+warm-pool plumbing. Drive Firecracker from Go viafirecracker-go-sdk; Kata via OCIRuntimeClass. - wazero (Wasm) — if (and only if) the role's code can be compiled to Wasm: strongest and cheapest, fully portable, in-process, no root. Doesn't fit "run arbitrary operator-allowlisted native programs," so it's a parallel track for pure-compute roles, not a replacement for the native-program path.
Cheap+portable vs needs-Linux+privileges, at a glance
- Cheap + portable: scrubbed env, timeout, temp cwd, output cap (done); rlimits (
GOOS-gated); unprivileged uid; wazero (if Wasm). - Linux, no root: user+net+mount+pid namespaces, Landlock, seccomp, bubblewrap/nsjail (with unprivileged userns), rootless podman.
- Linux + privileges/setup: cgroups v2 (delegation), gVisor (runtime install), Firecracker/Kata (
/dev/kvm+ images).
6. Two Go-specific findings worth remembering
- Surprisingly easy in Go: Linux namespaces are first-class via
syscall.SysProcAttr— pid/net/mount/user isolation with zero external binaries. wazero gives a genuine capability sandbox as a pure-Go library (no cgo, no extra process). go-landlock is a clean, unprivileged fs/network confinement library with graceful kernel fallback. - Surprisingly hard / footguns in Go: seccomp and tight rlimits don't compose with the Go parent. You cannot safely install a seccomp filter or a hard
RLIMITin the long-running, multithreaded Go daemon, andSysProcAttrhas no seccomp/rlimit fields — the only correct place is a thin child launcher (/proc/self/exere-exec, ornsjail/bwrap) that setsNO_NEW_PRIVS+ seccomp + Landlock + rlimits and thenexecs the allowlisted program. Also:RLIMIT_AS≠ real memory (use cgroupmemory.max),RLIMIT_NPROCis per-uid global (use the pids cgroup), and Landlock can't restrictnet.Listen()because Go 1.24 defaults to MPTCP.
Sources
- gVisor — Performance Guide, Releasing Systrap (2023-04-28), Optimizing seccomp usage (2024-02-01), Platforms
- Kata vs gVisor vs Firecracker — Northflank: Kata vs gVisor, Northflank: Kata vs Firecracker vs gVisor, KubeBlocks runC/Kata/gVisor DB benchmark, onidel (2025)
- Firecracker — Northflank: What is AWS Firecracker, cloudrps: Firecracker microVMs explained, E2B: Firecracker vs QEMU
- AI agent sandboxing landscape (2026) — manveerc: How to sandbox AI agents in 2026, particula: SmolVM vs Firecracker vs Docker
- Landlock — Kernel: Landlock LSM, Kernel: unprivileged access control, landlock.io, go-landlock pkg.go.dev, Sandboxing network tools with Landlock (2025-12-06), LWN: Toward unprivileged sandboxing (2017)
- bubblewrap / firejail — containers/bubblewrap, netblue30/firejail, Firejail blog: Sandbox Linux apps (2025-08-20), ArchWiki: Bubblewrap
- nsjail — google/nsjail, nsjail.dev
- cgroups v2 — containerd/cgroups/v2 pkg.go.dev, Kernel: Control Group v2
- seccomp in Go — elastic/go-seccomp-bpf, seccomp/libseccomp-golang, Kernel: seccomp_filter
- Go rlimits / exec — Go src: syscall/rlimit.go, golang/go #30401 (macOS Setrlimit), golang/go #66797 / #67184 (NOFILE cache), Go src: syscall/exec_linux.go
- wazero / Wasm — wazero docs, Wazero hardening for Go embedders, eunomia: WASI & Component Model status (2025-02-16)
Reference compiled 2026-06-30. Verify kernel/library versions against the actual deploy box before relying on a given ABI/feature.