Token-economy benchmark — toc/section retrieval cost

Status: Shipped to main (PR #21, merge 3ad4c21). See the manifest below; the
original TASK spec and the later findings follow.


Status & manifest — everything in one place (2026-06-21)

Shipped to main via PR #21. Prod (trip2g.com) needs a redeploy from
~/projects/trip2g (pull main) to serve it. e2e is local-only
(./scripts/test-e2e.sh; CI has it if: false). The user-facing docs below publish via
obsidian-sync.

What shipped

  • expand / federated_expand — progressive-disclosure TOC navigation. Returns the
    direct children of a TOC node (title / level / path / has_children); empty toc_path
    = top level. Walk the tree level by level without loading content or searching all chunks.
  • Slim search — dropped the per-result flat toc[]; kept the precise per-match
    toc_path. Structure now comes from expand.
  • stdio MCP adapter — one zero-dependency composite tool over search → expand → note_html, so an agent gets just the relevant section in one call.

Measured results (live trip2g.com, pre-deploy)

  • Reading the relevant section costs ~11–14× fewer tokens than the whole note (median,
    real queries). Scales with note size; short notes save little (~3×).
  • Token saving is the easy half. Selecting the right section is the hard half
    deterministic selection missed 8/9; a fixed grep window lost the answer 3/4. That is why
    expand (navigation) plus an agent/LLM picker matter.

Where everything lives

Piece Path
expand / federated_expand (server) internal/case/mcp/{resolve,toc_path,types,federation_handlers}.go, internal/federation/client.go, internal/model/federation.go
Live token-economy verifier (zero-dep) scripts/token_economy_check.py
Live expand/navigation test (zero-dep) scripts/expand_check.py
stdio MCP adapter (zero-dep) docs/en/user/trip2g_mcp_stdio_adapter.py
Adapter setup note (EN/RU) docs/{en,ru}/user/ai-agent-mcp-adapter.md
Token-economy chart note (EN/RU) docs/en/user/token-economy-bench.md, docs/ru/user/token-economy-bench.md + *.datachart.csv
This design + findings docs/dev/token_economy_bench.md (you are here)

Reproduce / use

  • Verify token economy on any instance: python3 scripts/token_economy_check.py
  • Test expand navigation (after deploy): python3 scripts/expand_check.py
  • Give an agent the one-call tool: register trip2g_mcp_stdio_adapter.py as an MCP stdio
    server (config in the adapter note).

Remaining

  1. Redeploy prod from ~/projects/trip2g (pull main) → serves expand + slim search.
  2. Run ./scripts/test-e2e.sh locally (CI e2e is disabled).
  3. obsidian-sync the docs to publish the chart + adapter note.

The sections below are the original TASK spec (kept for history) and the
implementation findings recorded along the way.

Why

docs/en/user/Token Economy.md and docs/en/user/Fuzzy Pointer.md make concrete
token-cost claims that no benchmark backs:

  • "one section is ~300 tokens. The full note is ~3000 tokens" → ~10× saving
    (Token Economy.md:21, Fuzzy Pointer.md:68).
  • loading the whole note "costs 10× more tokens and pushes the actual answer near the
    bottom of a long context window" (Fuzzy Pointer.md:76).
  • GraphQL field selection saves tokens by dropping unused fields
    (Token Economy.md:38-73).

Retrieval quality is already measured — docs/dev/retrieval_eval.md and
docs/en/thoughts/search-benchmark.md (the heading breadcrumb / F4 feature lifts
long-doc nDCG +0.023, en-en +0.092). What's missing is retrieval cost: the actual
token count of the drill-down patterns. This task fills that gap and replaces the
hand-waved "~300 / ~3000" numbers in the user docs with measured ones.

Keep it standalone. This is a single-instance cost benchmark. Do not mix it
with federation or the whoami mesh (docs/dev/whoami_test.md) — federated_note_html
takes match_id, not toc_path (internal/model/federation.go:54), so the federated
cost variant is a separate follow-up.

What to measure

A corpus of large multi-section notes + a query set with a known answer-section per
query. For each query, measure the tokens an agent would consume to reach the answer
under each retrieval pattern (all read the same hit):

Arm Calls Tokens counted
full (anti-pattern) searchnote_html(pid) search summary + whole note
toc_path (fuzzy pointer) searchnote_html(pid, toc_path=section) summary + one section
match_id (focused window) searchnote_html(pid, match_id) summary + chunk window
gql_all GraphQL note query, all fields body+frontmatter+raw_markdown+timestamps+tags
gql_min GraphQL note query, minimal fields e.g. highlightedTitle+url only

Per arm: token count (use a fixed tokenizer — e.g. tiktoken cl100k; record which,
since absolute numbers depend on it, ratios less so). Headline outputs:
ratio_section = section/full, ratio_window = window/full, ratio_gql = gql_min/gql_all.
Primary goal: confirm or correct the ~300/~3000 ≈ 10× claim.

Also log a correctness guard per arm (does the returned text actually contain the
answer token?) so a cheap arm that drops the answer isn't scored as a win.

Setup

  • Single trip2g instance. Vector ON via the mock embedding server
    (scripts/mock-embedding-server.mjs): the real chunk pipeline (internal/mdchunk,
    chunkTargetTokens=450) runs and produces precise chunk breadcrumbs, so toc_path
    resolves to the section — but the mock returns instant 1024-dim vectors, so no
    2.3 GiB model, no ~300s load
    . Token cost is independent of embedding quality; you
    only need chunks to exist. (With vector OFF, toc_path degrades to the note title —
    verified manually — so FTS-only is not a valid setup for this bench.)
  • Corpus: synthetic notes, sweep section count and section size (e.g. 5/10/20 sections
    × ~450-token sections). Each section embeds a unique answer token; queries map to a
    known section.
  • Fresh isolated SQLite, fast queue (GLOBAL_QUEUE_POLL_INTERVAL=100ms), wait for
    embedding jobs before measuring — same isolation pattern as scripts/bench-pushnotes.sh.
  • Not federated. Not the whoami mesh.

Harness + output (match the house style)

Mirror scripts/bench-pushnotes.{sh,mjs}:

  • scripts/bench-token-economy.sh drives scripts/bench-token-economy.mjs, env-driven
    (APP_URL, INTERNAL_URL, mock embedding), sweeps params, emits one JSON line per
    run
    to stdout → tee into .jsonl.
  • JSON fields: sections, section_tokens, full_tokens, window_tokens,
    gql_all_tokens, gql_min_tokens, ratio_section, ratio_window, ratio_gql,
    answer_present (per arm), tokenizer.
  • Publish like docs/dev/benchmark.md: .jsonl → wide-form .csv
    docs/{en,ru}/user/token_economy_bench.datachart.csv. Then update
    Token Economy.md / Fuzzy Pointer.md to cite the measured ratios instead of the
    current estimates.

Out of scope / separate work

  • Retrieval quality (Recall@10 / nDCG / MRR): already docs/dev/retrieval_eval.md
    • docs/en/thoughts/search-benchmark.md. This bench is cost, not quality.
  • Federation cost: federated_note_html(match_id) window vs full note across a peer
    — a follow-up; toc_path is local-only.
  • Position-recall ("Lost in the Middle", Token Economy.md:85): whether an answer
    at the top of context beats one buried in a full note. This is an LLM-recall study
    (needs a judge model + a held-out QA set), not a deterministic token count — track it
    as its own experiment, not part of this bench.

Validates

  • Token Economy.md:21 / Fuzzy Pointer.md:68 — section ~300 vs full ~3000 (~10×).
  • Token Economy.md:38-73 — GraphQL field-selection savings.

Update 2026-06-21 — what we actually built, and what it revealed

We did not build the synthetic mjs harness above. A simpler artifact answered the
question better: a zero-dependency Python script (scripts/token_economy_check.py)
that hits the LIVE public MCP endpoint (trip2g.com/_system/mcp) anonymously and
measures token economy on the real docs vault. Anyone can run it:
python3 scripts/token_economy_check.py.

Findings (live, real docs):

  • A focused read costs ~11× fewer tokens than the whole note (median across 9
    real queries). The "~300/3000 ≈ 10×" claim is directionally honest; the win scales
    with note size — long notes save most, short notes (already cheap) save little.
  • The token saving is the easy, nearly self-evident half. The hard, interesting
    half is selecting the right section: deterministic selection (regex, then
    snippet-vs-HTML substring) failed on 8 of 9 real queries. Picking the
    answer-bearing toc_path is a navigation problem, not a token-count problem.
  • Shareable chart: docs/en/user/token-economy-bench.md (+ ru) renders a live
    datachart from docs/en/user/token_economy_bench.datachart.csv.

Next: expand — progressive-disclosure tree navigation (new MCP tool)

The benchmark surfaced the real feature. Today an agent gets the whole flat toc
from search and jumps straight to a leaf. Progressive disclosure means walking the
tree node by node without loading content or searching every chunk:

toc(pid)                       -> top-level sections          (~20 tokens)
expand(pid, toc_path)          -> direct children of a node   (~15 tokens)
note_html(pid, toc_path)       -> read the chosen leaf        (~300 tokens)

expand(pid, toc_path) returns the direct children of a TOC node — title,
level, path, has_children — cheap labels to decide which branch to enter. An
empty/omitted toc_path returns the top level. Vector search becomes the entry
point
; structural navigation moves around from there. The heading-breadcrumb-in-chunk
(Contextual Retrieval, already shipped as F4) is the bridge: a fuzzy query lands on a
chunk whose embedded breadcrumb is a precise toc_path.

Two corollaries from review:

  • Drop the flat toc from search results. Once expand exists, structure is
    fetched on demand, so the per-result toc[] that search ships today is dead
    weight in every response — removing it slims search too. Nothing depends on it
    after expand lands.
  • Federated parity. Every navigation tool needs a federated_* twin
    (federated_expand) so a peer KB's tree navigates exactly like a local one,
    matching federated_search / federated_note_html.

External fit: a "detailed TOC of a 10-hour podcast" (knowlume distills long podcasts
and exposes exactly such a tree) is navigated with expand instead of dumping the
transcript — the same pattern, different corpus.

We start implementing expand (+ federated_expand, + search slimming) next.