Federation hot cache — spec
Status: deferred — not worth building yet (no concrete need). The realistic case is already covered
more cheaply by hand: an owner who knows their node is going down can hand a trusted peer a replica in
advance. The automated, continuously-syncing pull machinery below only earns its keep for unplanned or
permanent disappearance, or for hands-off freshness — none of which is pressing now. Keep this as a
design reference; revisit when a live federated peer or the agent runtime (agent.md) actually needs
continuous, hands-off caching over a neighbor KB. The design below (admin-only, isolated cache tables)
stands — it is simply not scheduled.
TL;DR. A continuously-updated, admin-only local copy ("hot cache") of a neighbor's shared federated
subgraph, kept in its own cache tables + a dedicated vector index — never in your note-store. Two
purposes: don't fear losing the source (a live local backup that auto-refreshes while the neighbor is up),
and let the admin search / vector-retrieve a neighbor's knowledge locally — including when the neighbor is
down or gone for good. Single-writer: the owner writes; peers hold a read copy.
Hard constraint — never re-transmit a neighbor's private data. The cache exists for the admin's own
consumption only. Consequences that shape the whole design:
- No public serving of a neighbor's content, ever.
- No permission-role machinery. Admin-only ⇒ there is nobody to regulate access for, so the entire
cross-instance access-policy problem (mirroring the origin's paywall, who is a subscriber, etc.)
disappears. - No forwarding of private content to a third peer. Transitive propagation (phase 2) is allowed only
for content the origin explicitly marked public.
Storage decision — separate cache tables + a dedicated vector index (NOT materialized into note_paths).
Because "private foreign data must never leak" is a hard invariant, physical isolation beats flag
discipline: if the replica lives in its own replica_* tables, the publishing / serving / sitemap / git
mirror / webhook / Telegram / public-site-search pipelines structurally cannot see it — there is nothing
to suppress and nothing to forget. The only cost is that the cache's one consumer (admin search + vector)
must read the cache store explicitly; at admin-only scope that is far cheaper and safer than gating five
outbound pipelines on a flag. (This reverses an earlier "materialize into the note-store" idea, which was
attractive for "free" indexing but required suppressing every outbound handler and carried real leak risk.)
Why — and why not just a whole-DB replica
The value is the admin's federated search + a local vector base over a neighbor's subgraph, plus a
self-refreshing backup so you don't lose the source:
- A live proxy can't be embedded. Today
federated_searchproxies live to the neighbor's origin. So
when the neighbor is down, search over it breaks; and you cannot vector-index its content — you can
only embed text you hold locally. - A whole-DB replica (LiteFS / Litestream) doesn't solve this. Those copy your own whole DB for your
own nodes. They don't pull a neighbor's partial subgraph across a trust boundary, nor give you a local,
embeddable copy of just the part a peer shared. - So the hot cache is a local materialization of a neighbor's subgraph — but kept isolated. Once it's
local you can embed it, search it, and you still have it if the neighbor vanishes. Surviving the
neighbor's outage or permanent disappearance is a consequence of holding a local copy.
This is the read half of a single-writer knowledge network on cheap/disposable peers: the knowledge survives
any one node because trusted peers hold copies (see ../en/thoughts/surviving-node-loss.md). It pairs with
the agent runtime (agent.md): an admin's fleet sidecar can retrieve over a cached neighbor KB.
What already exists (reuse)
- Federation trust + transport. HMAC-SHA256 JWT (
alg=HS256,kid,exp=iat+30s); secrets in
federation_secrets(kid,secret_crypt,kb_url,revoked_at) + per-subgraph scope in
federation_secret_subgraphs. Outbound:internal/federation/client.go. Inbound:
internal/case/mcp/federation_helpers.go:verifyInbound()→(kid, allowed_subgraphs); entry via
internal/case/mcp/endpoint.go.federated_searchproxies live today; the hub stores no remote data. - Versioning cursor.
note_versions.idis a global autoincrement — a natural high-watermark cursor
(migration20250515071315_add_id_to_note_versions.sql).note_paths.latest_content_hash;
soft-delete vianote_paths.hidden_by/hidden_at. - Subgraph membership in frontmatter (
subgraph/subgraphs→NoteView.SubgraphNames,
internal/model/note.go). - Embedding mechanics to mirror (not reuse directly). Chunking via
mdchunk.Split, OpenAI passage
embeddings, brute-force dot-product scoring (internal/case/sitesearch/resolve.go,
internal/case/backjob/generatenoteversionembedding/resolve.go). The cache builds a parallel vector
index over its own tables using the same primitives.
So trigger, auth, scope, the cursor, and the embedding primitives exist. Missing for v1: a pull-diff
endpoint, the cache store + its own sync loop and vector index, an admin search over it, and a pull grant.
v1 design (admin-only, isolated cache)
Grant
The origin grants a peer replica-pull rights via a new can_pull_replica flag on its federation_secrets
row for that peer. Scope stays gated by federation_secret_subgraphs.
Origin — pull-diff endpoint
POST /api/federation/v1/pull_diff { subgraph, since_version_id }, authed by the existing HMAC federation
JWT. New Endpoint under internal/case/federation/pull_diff/, registered in
internal/router/endpoints_gen.go. It:
verifyInbound()→(kid, allowed_subgraphs); reject ifsubgraph∉ scope orcan_pull_replicaunset
(403); revoked kid → 401.- Queries
note_versions where id > since_version_idfor the subgraph's paths + hidden paths since the
cursor (tombstones). Membership lives in frontmatter, so maintain a derived
federation_note_subgraphs(path_id, subgraph_id)table (updated by the frontmatter-change hook) for an
O(1) join instead of re-parsing notes per request. - Returns upserts +
deleted_paths+ a signed envelope:{ "subgraph": "...", "since_version_id": N, "last_version_id": M, "has_more": false, "upserts": [{ "path", "version_id", "content", "content_hash", "updated_at" }], "deleted_paths": ["..."], "root_sig": { "type": "hmac", "value": "..." } }since_version_id: 0= full pull.has_morepaginates. Origin lost the cursor (vacuum) →
{ "full_resync_required": true }→ puller resets to 0. Sign reuseshmacSHA256()+
subtle.ConstantTimeCompare.
Puller — cache store + sync loop
Separate tables, owned entirely by the puller:
subgraph_replica_cursors(origin_kid, subgraph, last_version_id, sync_interval_seconds, last_synced_at)
— one row per cached subgraph; default interval 5 min, pull-poll (self-healing).replica_notes(origin_kid, subgraph, path, version_id, content, content_hash, updated_at),
PK(origin_kid, subgraph, path). Theorigin_kidprefix makes two peers (or a peer + a local note of
the same path) collision-free by construction.replica_deleted_paths(origin_kid, subgraph, path, hidden_at)— tombstones.replica_note_chunks(origin_kid, subgraph, path, chunk_idx, content, content_hash, embedding)— the
dedicated cache vector index, mirroringnote_version_chunks.
Each tick: pull → recompute and constant-time-compare root_sig (reject tampered diffs) → upsert
replica_notes, apply tombstones → re-chunk + re-embed changed paths into replica_note_chunks
immediately (don't wait for any cron) → advance the cursor only after a successful write.
Admin consumer surface
An admin-scoped federated search reads the cache: vector dot-product over replica_note_chunks (+
optional keyword over replica_notes). The existing federated_search is repointed to prefer the local
cache (fast, survives origin-down) and may fall back to a live proxy for non-cached subgraphs. No new
public surface; no roles.
Isolation by construction (what we no longer need)
Because the cache is not in note_paths/note_versions, none of these can ever see it — so there is nothing
to gate: git mirror (internal/gitapi/materialize.go), change_webhooks / agents, Telegram publish, sitemap,
custom-domain routing, and the public site search / vector (Bleve from AllLatestNotes,
LatestNoteChunks). The neighbor's content cannot leak through any of them. This is the structural payoff of
separate tables.
Out of v1 — hard limits (not just deferred)
- Public serving of a neighbor's content — forbidden by the never-re-transmit invariant.
- Cross-instance permission roles — unnecessary at admin-only scope; do not build them.
- Forwarding private content to third peers — forbidden. Only origin-marked-public content may ever
propagate further (phase 2).
Phase 2 — and forward-compat seams to land in v1
root_sigis a versioned envelope from day one ({type, value}), so phase 2 can upgrade to a Merkle
attestation ({type:"merkle", merkle_root, signature, proof}) — same field, stronger verification.- Adoption (origin gone for good). Promote a cached subgraph to your own notes: a real migration from
replica_*intonote_paths/note_versions(deliberately explicit, not a flag flip, since the stores
are separate). Only legitimate for content you are permitted to keep/own; private foreign data stays a
private admin backup, not a republished vault. - Transitive (Merkle) verification — public content only. A third peer can verify a cached subgraph
without the origin's shared secret if the origin signed a Merkle root with an asymmetricorigin_pubkey
(Ed25519) added tofederation_secrets: the holder serves a note onward with an inclusion proof + the
signed root; the third peer verifies with the origin's public key, even if the origin is dead. This is
what would turn the cache into a P2P preservation network — but gated to content the origin marked
public, never private subgraphs. Background: Certificate Transparency (RFC 9162) is the closest
production model of a signed Merkle root.
Options compared (transport)
| A: pull-poll (recommended) | B: MCP pull_diff tool |
C: push-on-change webhook | |
|---|---|---|---|
| New auth surface | none (reuses HMAC JWT) | none | friend must expose an inbound endpoint |
| Origin complexity | one handler + a few queries | one MCP tool case | delivery queue + retries |
| Freshness | interval (default 5 min) | same | sub-second |
| Self-healing | yes (next poll) | yes | needs retry logic |
| Timeout risk | low (REST) | real (short MCP client timeout) | n/a |
| v1 | yes | bolt-on later | no |
Test plan
- Auth gates: kid without
can_pull_replica→ 403; flag set but subgraph out of scope → 403; revoked
kid → 401. - Incremental correctness: write 3 notes, pull from 0 → all 3; write a 4th, pull from prior max → only
the 4th. - Tombstones: hide a note → it appears in
deleted_pathsand is removed fromreplica_notes+
replica_note_chunks, not inupserts. - Pagination: insert > batch size →
has_more: true, second pull returns the rest. - Integrity rejection: tamper one byte → puller rejects on
root_sigmismatch. - Full-resync: origin returns
full_resync_required→ puller resets the cursor to 0 and re-pulls. - Isolation (the core invariant): sync a subgraph → assert it is absent from
note_paths, the
public site search, the public vector index, the sitemap, the git mirror, and that no change_webhook /
agent / Telegram publish fired. It must appear only in the admin federated search. - Vector freshness: sync after boot → the new content is immediately retrievable via the admin cache
vector search (no restart, no cron wait). - Collision safety: two origins sharing a path, and a local note of the same path, all coexist (keyed
byorigin_kid), none overwrites another.
Composition
Pairs with the agent runtime (agent.md): an admin fleet sidecar retrieves (vector/keyword) over a cached
neighbor KB and keeps working when the origin is down or gone. Adoption (phase 2) is how a surviving peer
keeps permitted knowledge alive permanently; private caches stay private backups.