Retrieval evaluation (vector-search benchmark)
A reproducible benchmark for trip2g's hybrid search (BM25 + vector, fused with RRF). It measures Recall@k, nDCG@k, and MRR over a versioned golden set of bilingual RU/EN queries, so every change to the search pipeline can be A/B-tested with numbers instead of guesses.
What's where
| Path | Role |
|---|---|
docker-compose.vecbench.yml |
Isolated stack: one app + MinIO + a bge-m3 embedding server, on its own ports (21081/21082, MinIO 22000/22001, embedding 11436). Independent of the e2e stack and dev. |
scripts/vecbench.sh |
Driver: up / sync / eval / down / logs. |
testdata/vecbench/vault/ |
The corpus: 6 topics × {EN, RU} = 12 notes, so all four retrieval directions are testable. See its README.md. |
testdata/eval/golden_set.json |
Versioned query → relevant-URL pairs (qrels). |
cmd/evalretrieval/ |
CLI: runs the golden set against a live GraphQL endpoint, prints metrics, writes a JSON artifact, gates on a min nDCG. |
internal/retrievaleval/ |
Pure-Go metrics + golden-set loader + GraphQL search client + report aggregation. |
docs/superpowers/eval-runs/*.json |
Committed run artifacts — the data spine for the before/after comparison. |
Prerequisites
- Docker (the stack builds the app image and runs a local bge-m3 server).
- The bge-m3 model (~2.3 GB) downloads on first run into the shared
~/models/embeddingcache, reused across the e2e and vecbench stacks. - Node (for
obsidian-sync, used to push the vault).
Run it
# 1. Bring up the isolated stack, push the vault, generate embeddings, reload.
./scripts/vecbench.sh up # or: make vecbench-up
# 2. Run the benchmark; writes docs/superpowers/eval-runs/<label>.json
./scripts/vecbench.sh eval baseline-bgem3
# 3. Tear down when done (keeps the model cache).
./scripts/vecbench.sh down
After editing the vault, re-ingest with ./scripts/vecbench.sh sync (re-pushes, drains embedding jobs, reloads the app).
How it works (and two gotchas)
- The embeddings endpoint is required. The vector lane embeds the query live on every request, so the benchmark needs a reachable bge-m3 server. The compose file provides one.
- Embeddings are generated after the vault sync, by async jobs. The in-memory chunk cache (and the bleve index) load at boot, so right after a fresh sync the vector lane is empty until the app reloads.
vecbench.sh up/synchandle this by waiting for the job queue to drain (/debug/wait_all_jobs) and then restarting the app. (This stale-until-reload behaviour is a known limitation of the current loader — seedocs/dev/vector_search.md.) - The benchmark searches as admin. Anonymous search returns only live notes; the synced vault is draft/latest.
vecbench.sh evalsigns in with the dev code and passes the session JWT as a bearer token so the CLI sees the full (latest) index.
Building the golden set
Public benchmarks (MIRACL, ruMTEB, BEIR) measure embedding-model quality, but no public set matches this corpus. The golden set is small and project-specific:
- Generate candidate queries with an LLM (3 per note: synonyms, no copied phrases, plus one "need-based" phrasing), in both languages, including cross-lingual (RU query → EN note and vice versa — the topics are parallel).
- Hand-verify every pair: run it through the live endpoint, confirm the intended note is genuinely correct, and set
expected_urlsto the actual permalink(s) the endpoint returns. Drop ambiguous pairs (expect ~20% loss). Onlyverified: truepairs count. - Keep the four
directionbuckets (ru->ru,en->en,ru->en,en->ru) roughly balanced so cross-lingual regressions surface.
For picking/justifying the embedding model itself, the relevant public benchmarks are MIRACL-ru, ruMTEB retrieval, RusBEIR, NeuCLIRBench (EN→RU cross-lingual), and small BEIR sets (SciFact/NFCorpus) as an English CI guard.
Metrics
| Metric | Meaning |
|---|---|
| Recall@k | fraction of relevant notes found in the top k |
| nDCG@k | rank-discounted relevance (rewards putting the right note higher) |
| MRR | mean of 1/rank of the first relevant result |
The CLI reports each overall and per direction. A run with -fail-under-ndcg X (or EVAL_MIN_NDCG=X) exits nonzero when overall nDCG@k drops below X — use this as a CI regression gate once a baseline is established.
Run history
Corpus: 48 notes (24 topics × 2 languages = 6 themes, each a core note + 3 intra-theme distractors). Golden set: 60 hand-verified queries, balanced across the four directions; relevance = both language versions of the topic.
| Run | Recall@10 | nDCG@10 | MRR | Notes |
|---|---|---|---|---|
| 00-baseline-bgem3 | 0.9833 | 0.9157 | 0.9417 | bge-m3, current pipeline. Weakest direction: en→ru (nDCG 0.845, MRR 0.833). |
Per-direction at baseline: ru→ru nDCG 0.935 / MRR 0.972 · en→en 0.932 / 0.972 · ru→en 0.934 / 0.958 · en→ru 0.845 / 0.833.