Search refactoring — next steps (handoff)

This is a handoff for the next agent continuing the vector-search work. Read it together with docs/dev/search_refactoring.md (the running report with measured before/after numbers) and docs/dev/retrieval_eval.md (how the benchmark works).

What is already done (committed on branch feat/vector-search-benchmark)

  • Benchmark harnessinternal/retrievaleval/ (metrics, golden-set loader, GraphQL search client, report), CLI cmd/evalretrieval/, isolated stack docker-compose.vecbench.yml + scripts/vecbench.sh, two golden sets (testdata/eval/golden_set.json = 60 short-corpus queries, testdata/eval/golden_set_longdocs.json = 16 long-doc queries).
  • F1 widen RRF fusion pool (vectorTopK 5→50, mcp DefaultVectorSearchLimit 10→50) — recall 0.983→1.0. Shipped.
  • F2 per-language bleve analyzer (en + ru, dual-field + disjunction query) — correctness fix, ~0 metric delta on this corpus. Shipped.
  • F3 cross-encoder reranker (internal/reranker/, reranker-server/) — measured worse (promotes near-neighbour distractors). Shipped OFF by default behind vector_search.reranker.enabled.
  • F4 chunk heading breadcrumb ({title} > {h1} > {h2}\n\n{body}) + token-aware sizing — long-doc nDCG@10 0.9308→0.9539 (en→en +0.09). Shipped.

Baselines to beat: short corpus nDCG@10 0.9263, long-doc nDCG@10 0.9539 (both reranker-off, post-F4).

How to run the benchmark (important gotchas)

./scripts/vecbench.sh up          # build + fresh stack + push vault + embed + reload
./scripts/vecbench.sh rebuild     # after a Go CODE change (no re-embed)
./scripts/vecbench.sh sync        # after a VAULT change (re-push + embed + reload)
./scripts/vecbench.sh eval <label> <out.json>   # run golden_set.json (short) as admin
./scripts/vecbench.sh down

Endpoint: http://localhost:21081/_system/graphql (the /graphql alias is deprecated — always use /_system/graphql). Eval must run as admin (anonymous search returns only live notes; the vault is draft/latest) — vecbench.sh eval handles the dev sign-in (code 111111) and bearer token. To eval the long-doc set, run the CLI directly with -golden testdata/eval/golden_set_longdocs.json and a bearer token.

Critical gotchas (these cost hours):

  1. Chunker changes don't re-embed on sync. The embedding job dedups on the note content hash; changing chunker CODE doesn't change note content, so nothing re-embeds. To force it: wipe the DB (rm tmp/vecbench-data/vecbench.sqlite3*) and up.
  2. Stale .sync-state.json. obsidian-sync records what it already pushed; against a wiped DB it pushes nothing. cmd_up now rms testdata/vecbench/vault/.sync-state.json before pushing — keep that.
  3. In-memory chunk cache is stale until reload. Embeddings are generated by async jobs after sync; the app must reload to see them. vecbench.sh drains jobs (/debug/wait_all_jobs) then restarts the app — keep that flow.
  4. Concurrent edits: another session has uncommitted changes to cmd/server/main.go, internal/appconfig/config.go, and docs/marketing/*. Do NOT touch those files, and commit only your own files with explicit paths.

Remaining work items

1. F5 — AND→OR fallback + cosine norm precompute (effort S, quality ~0 / perf win)

  • Files: internal/noteloader/search.go (Search, the MatchQueryOperatorAnd at ~line 146), internal/case/sitesearch/resolve.go (vectorSearch ~120, cosineSimilarity ~335), internal/model/chunk.go (add Norm float32 to NoteChunk), internal/noteloader/loader.go (set norm at load).
  • Do: if the AND bleve query returns too few hits, re-run with MatchQueryOperatorOr and merge. Precompute each chunk's L2 norm at load (the embedding server returns normalized vectors — verify with sentence-transformers normalize_embeddings=True in embedding-server/server.py; if already unit-norm, switch cosine to a plain dot product). Add a Warn log of len(chunks) + scan duration in vectorSearch.
  • Expect: ~0 quality delta on this corpus (recall already maxed), latency win. Measure on both golden sets, record in search_refactoring.md, commit perf(search): AND->OR fallback + precomputed norms (F5).

2. Option A — graphql_request returns MCP structuredContent (effort S)

The capability the user wants ("pass a GraphQL selection, get exactly that JSON") already exists as the graphql_request MCP tool. gqlgen is the projector (returns only selected fields) via env.GraphQLRequest (cmd/server/main.go:3013). The only gap is the return shape.

  • File: internal/case/mcp/graphql_tools.go, handleGraphQLRequest (lines 53–68).
  • Do: after result, err := env.GraphQLRequest(...), json.Unmarshal(result, &parsed) and return structuredToolResult(stub, parsed) (helper at resolve.go:90). Pass the full {data,errors} envelope as structuredContent so GraphQL errors aren't hidden.
  • ⚠ TOKEN DUPLICATION (the user flagged this): structuredToolResult currently sets BOTH Content text AND StructuredContent to the payload → the model sees the JSON twice (~2× tokens). Do NOT pass the full JSON as the text arg. Instead pass a short stub as text (e.g. "structured result" or an empty string) and the parsed JSON as structuredContent — so the payload appears once. The whole point (fuzzy pointer + projection) is token economy; don't undo it. Verify which the trip2g MCP client actually feeds to the model and pick the single representation accordingly.
  • Test: internal/case/mcp/graphql_tools_test.go — assert StructuredContent is set and the text block is the stub, not a second copy.
  • Commit: feat(mcp): graphql_request returns structuredContent (single-copy).

3. Query-only + root-field whitelist guard on graphql_request (effort S, the real deliverable)

env.GraphQLRequest forges an admin token (appreq.WithAdminToken, cmd/server/main.go:3015), so an arbitrary GraphQL string can run mutations (createAdmin, banUser, delete notes) and read admin-only data. It's gated behind the per-key enableMcpAdminTools flag today, but before any agent loop uses it, add guards.

  • File: internal/case/mcp/graphql_tools.go (validate before calling env.GraphQLRequest). Do the validation in the mcp package (do NOT modify main.go — concurrent edits).
  • Do: parse the query with github.com/vektah/gqlparser/v2/parser ParseQuery (no schema needed). Reject if any OperationDefinition.Operation != Query (block Mutation/Subscription, incl. multi-operation smuggling). Whitelist top-level selection field names to a read set: note, search, similarNotes, viewer, notePaths, resolveWikilinks (confirm exact root field names in internal/graph/schema.graphqls); reject admin and anything else. Log every call (operation string + key owner) for audit.
  • Test: reject a mutation, reject { admin { ... } }, allow { note(...) { title } }.
  • Commit: feat(mcp): query-only + root-field allowlist for graphql_request.
  • Stretch (separate, larger — do NOT bundle): run the query in the caller's context instead of admin so CanReadNote/subgraph gates apply. This is a broad change (the whole MCP read model equates API key with admin, resolve.go:485); leave as a follow-up.

4. Harden tocPathForSnippet — make the fuzzy pointer reliable (effort S–M)

This is the linchpin of the "fuzzy → exact section" chain. It currently substring-matches the search snippet against rendered HTML sections and can silently return nil, dropping the breadcrumb.

  • File: internal/case/mcp/toc_path.go (tocPathForSnippet ~57, findDeepestSection/matchHeaderDiv ~79).
  • Do:
    • (a) Normalize both sides with ONE shared function before the strings.Contains match. This is the core fix — today the two sides drift:
      • The asymmetry to remove: the snippet side (toc_path.go:62) extracts text with htmlPlainText(markedContext(snippet)), but the section side (matchHeaderDiv, toc_path.go:89) uses htmlNodeText(n). Two different extractors over two different HTML pipelines (bleve highlight fragment vs goldmark-rendered note HTML) → the same words can serialize differently, so Contains fails.
      • Write one normalizeForMatch(s string) string and call it on BOTH the snippet target and each section's text. It must, in this order:
        1. extract plain text the same way for both (pick one path — e.g. parse with golang.org/x/net/html and concat text nodes — don't use htmlPlainText on one side and htmlNodeText on the other);
        2. html.UnescapeString to decode entities (&amp;&, &nbsp;→U+00A0, &#39;') — the bleve fragment and goldmark HTML escape differently, so decode on both sides;
        3. replace non-breaking / zero-width / soft-hyphen runes (U+00A0, U+200B, U+00AD, U+FEFF) with a normal space (or drop the zero-width ones) — strings.Fields treats U+00A0 as space but not U+200B/U+00AD;
        4. Unicode-normalize to NFC (golang.org/x/text/unicode/norm) so composed vs decomposed Cyrillic/Latin diacritics match;
        5. strings.ToLower;
        6. collapse whitespace via strings.Join(strings.Fields(s), " ").
      • Optionally also fold typographic punctuation (smart quotes “”’"', en/em dash → -) since goldmark may smart-quote while bleve keeps ASCII. Keep this conservative — only quotes/dashes, don't strip all punctuation (that over-matches).
      • Replace line 62 with target := normalizeForMatch(htmlOf(markedContext(snippet))) and line 89 with sectionText := normalizeForMatch(innerHTMLOf(n)), both feeding the same extractor+normalizer. Add a unit test that the same logical text rendered by the two pipelines normalizes to byte-identical strings.
    • (b) Add a robust fallback using F4: the chunk content now starts with the breadcrumb {title} > {section} > {subsection}. The match_id (p{pid}:c{chunk}) identifies the chunk; read that chunk's breadcrumb and derive the toc_path directly — no fuzzy text match needed. This always yields a path.
    • (c) Never return nil — fall back to the note root / first heading.
    • Invariant: sectionHTMLByTocPath/navigateSectionPath match data-header by exact title string; the emitted path must use the same heading strings as the data-header attribute. Don't introduce normalization drift between emit and lookup.
  • Test: a snippet whose HTML reflows differently still resolves; an intro-before-first-heading match still gets a path; a chunk-index fallback path round-trips through sectionHTMLByTocPath.
  • Commit: fix(mcp): robust toc_path from snippet (normalize + chunk-breadcrumb fallback).

5. MCP tool descriptions — explain the breadcrumb + drill-down (effort S) — task #7

  • File: internal/case/mcp/resolve.gosearch description (line ~178), federated_search description, note_html description (~203).
  • Do: document the complementary drill-down so the consuming LLM uses it:
    1. search returns a chunk snippet with a heading breadcrumb {title} > {section} > {subsection} — locates the approximate section;
    2. the search result's TOC (path arrays) gives the note's precise structure;
    3. note_html(toc_path=[…]) reads the exact section.
      Frame breadcrumb → TOC → toc_path as fuzzy → structure → precise.
  • Commit: docs(mcp): describe breadcrumb + toc_path drill-down in tool descriptions.

6. Docs (delegate to the trip2g-docs agent — EN + RU passes)

  • docs/dev/vector_search.md (task #2) — it's stale (lists hybrid search as "future"). Rewrite to current reality: hybrid BM25+vector RRF k=60, per-chunk embeddings, brute-force cosine, bge-m3 default, per-language analyzer, optional reranker (off), and the F4 chunk format {title} > {h1} > {h2}\n\n{body} + token sizing. Note the breadcrumb is non-obvious and appears in snippets. Verify against code.
  • docs/{en,ru}/thoughts/… (task #6) — article on the bugs found + the benchmark + the fixes. Results first (the deltas table from search_refactoring.md), details below. Follow docs/CLAUDE.md writing rules (суть сначала, без канцелярита).
  • docs/{en,ru}/user/Fuzzy Pointer.md (task #8) — user-facing concept note: vector search returns a deliberately imprecise pointer (breadcrumb) that resolves via the TOC to an exact section (note_html toc_path). Three-step drill-down; how AI agents use it over a published vault.

Decided AGAINST (do not build)

  • federated_graphql_request — arbitrary GraphQL across federation trust boundaries. Federation is deliberately narrow (scoped by inbound/outbound secrets + subgraph grants); letting peer A run arbitrary GraphQL on peer B (as admin) is a different, much worse risk class, plus schema-drift across peer versions and N-shape result merging. If structured cross-pool data is ever needed, add specific curated federated read fields, not arbitrary GraphQL.

Suggested order

F5 (finish the fix track) → Option A + query-only guard (small, high value, satisfies the GraphQL-projection ask safely) → harden tocPathForSnippet → MCP descriptions → docs last (so they describe the final state).