search_vector_research

Vector Search Research

Summary of Findings

Research on best practices for semantic search with text-embedding-3-small over an Obsidian Markdown vault.

1. Chunking

Should you chunk? Yes, always. Embedding an entire note (2000–10000 tokens) degrades retrieval quality because the embedding averages over too many topics. Even short notes benefit from consistent chunking.

Recommended chunk size for Markdown: 400–512 tokens with 50–100 tokens overlap (10–20%). Use a semantic/recursive splitter on ["\n\n", "\n", ". ", " "] — respects heading and paragraph boundaries. Fixed-size splitting is worse.

Parent Document Retriever pattern (dominant in production RAG):

Embed and index only small child chunks (200–512 tokens) for vector search
Store full note keyed by doc_id metadata on every child chunk
On query: find top-K matching child chunks → fetch full parent note per doc_id
Gives precise retrieval (small chunks match better) + full context for display
Natural parent boundary for Obsidian: the full note file

2. Short Query vs Long Document Problem

Single-word query ("space") vs long document gives low cosine similarity (~0.2–0.3) even for relevant pairs. This is expected — the query is a sparse point, the document is a centroid over many concepts.

Fixes by implementation cost:

Technique	How	Cost	Gain
Prepend title to chunk	Add note title + H1 to each chunk before embedding	Zero	Low-medium
Query expansion (LLM)	Rewrite short query into 2–3 sentences before embedding	+1 LLM call	High
HyDE	Generate hypothetical answer doc, embed that	+1 LLM call	High
Multi-query	3–5 query variants, union results	+1 LLM call	Medium
BM25 hybrid (already done)	FTS + vector + RRF	Already implemented	High for keyword queries

Easiest quick win: prepend note title and H1 heading to every chunk text at index time. Zero cost.

HyDE for short queries: when query length < 5 words, use GPT-4o-mini to generate a hypothetical passage that would answer the question, then embed that passage. Document-to-document similarity is much higher than query-to-document.

3. HyDE (Hypothetical Document Embeddings)

Take the user query
Prompt LLM: "Write a short passage that would answer: [query]"
Embed the generated hypothetical document
Search corpus with that embedding
Discard hypothetical document — never shown to user

Why it helps: the hypothetical doc lives in the same dense region as real documents. Particularly strong for multilingual retrieval. Outperforms BM25 and unsupervised dense retrieval on benchmark tasks.

When it hurts: LLM hallucination in domain-specific/private vaults can drift the embedding away from the real corpus.

Verdict: implement as optional path for short queries (< 5 words).

4. Similarity Threshold Calibration

text-embedding-3-small returns lower scores than ada-002 (~5-10% lower). Community rule of thumb: start at 0.25–0.30, not 0.70+.

Fixed threshold is a bad idea as sole gate. Two failure modes:

Too high → zero results for valid queries
Too low → noise floods results

Practical approach:

Start with top-K (K=5–10) without threshold during development
Sample 50–100 relevant + 50–100 irrelevant query/note pairs from your vault
Plot score distribution, set threshold at the natural gap
Hybrid gate in production: score >= threshold OR rank <= 3 — always return at least 1 result

Expected similarity values for text-embedding-3-small (Russian):

Pair type	Expected cosine similarity
Highly relevant (same topic, similar phrasing)	0.55–0.75
Relevant (same topic, different phrasing)	0.35–0.55
Tangentially related	0.20–0.35
Unrelated	0.10–0.25
Short 1–3 word query vs full document	0.15–0.30 even for relevant pairs

Russian scores ~5–10% lower than English equivalents (training data is English-dominant).

5. Asymmetric Search — Better Models for Russian

text-embedding-3-small is symmetric (single encoder for query and document). Dedicated asymmetric models are better for retrieval.

Best open-source alternatives for Russian (ruMTEB benchmark):

Model	Notes
multilingual-e5-large-instruct	Best on ruMTEB; use `query:` prefix for queries, `passage:` for docs
BGE-M3	Best on Russian retrieval specifically; supports dense + sparse; 100+ languages
ru-en-RoSBERTa	Bilingual RU+EN; competitive at lower cost

Instruction-tuned models (Instruct: Retrieve passages that answer the question) directly address the asymmetric query/document problem without needing HyDE.

Verdict: text-embedding-3-small is pragmatic if already using OpenAI API. Upgrade to BGE-M3 or mE5large-instruct only if measured recall is unsatisfactory.

6. Implementation Roadmap (Ordered by ROI)

Top-K without threshold — remove fixed threshold during development, inspect real distributions
Prepend note title to every chunk — zero cost, improves short query recall
Chunking (400–512 tokens, Markdown boundaries) + Parent Document Retriever pattern
HyDE for short queries (< 5 words) — one GPT-4o-mini call
Threshold from data — calibrate from vault-specific query/note pairs
BGE-M3 or mE5large-instruct if Russian quality insufficient

7. Reference Implementations

tobi/qmd (TypeScript, local-only)

Offline semantic search over Markdown via MCP. Dual-model hybrid:

SQLite FTS5 (BM25) + sqlite-vec cosine similarity
Local models: EmbeddingGemma-300M or Qwen3-Embedding-0.6B (GGUF via node-llama-cpp)
Query prompt template: "task: search result | query: {text}"

Chunking (sophisticated, markdown-aware):

Target ~900 tokens, 15% overlap
Boundary scoring: headings (50–100 pts), code fences (80 pts), paragraph breaks (20 pts)
Squared-distance decay within ±200-token window
Code fences never split mid-block

Full pipeline:

Query → LLM query expansion (3 variants)
      → BM25 + vector search per variant
      → RRF (k=60)
      → Top 30 candidates
      → Cross-encoder reranker (Qwen3-Reranker-0.6B)
      → Position-aware blend
      → Final results

Applicable to trip2g: sqlite-vec, markdown-aware chunking, RRF fusion. Query expansion and cross-encoder only if latency allows.

tomohiro-owada/devrag (Go + CGO)

Token-efficient RAG for Claude Code. Pure vector search, no BM25. Written in Go — directly portable.

multilingual-e5-small (384-dim) via ONNX Runtime in-process
sqlite-vec for storage and cosine search
Clean Embedder interface (Embed / EmbedBatch / Close) — swappable: mock uses deterministic SHA256-based vectors for unit tests
Simple fixed-size 500-char chunking (no markdown awareness — upgrade this)
Incremental indexing with glob config

Key Go libraries:

Library	Purpose
`mattn/go-sqlite3` / `modernc.org/sqlite`	SQLite driver (already in use)
`asg017/sqlite-vec`	Vector similarity extension
`yalue/onnxruntime_go`	In-process ONNX inference (CGO)
FTS5	Built into SQLite

Note (trip2g): trip2g uses modernc.org/sqlite (pure Go, no CGO). sqlite-vec requires CGO and loadable extensions — incompatible with the current driver. Switching to mattn/go-sqlite3 breaks pure-Go builds and complicates cross-compilation. Not worth it at current scale (~300 notes).

8. Recommended Architecture for trip2g

Combining both repos + research findings:

Storage

Use sqlite-vec — same SQLite file, no new infrastructure. FTS5 already in SQLite for BM25.

Chunking

Split at heading boundaries, protect code fences, target 600–900 tokens, 10–15% overlap. Prepend note title + H1 to each chunk before embedding.

Retrieval

query
  → SQLite FTS5 BM25                ← already implemented (Bleve → migrate to FTS5)
  → sqlite-vec cosine similarity    ← new
  → RRF(k=60) fusion                ← already implemented
  → top-N results (no fixed threshold, or very low ~0.25)

Embedding model options

Hosted: OpenAI text-embedding-3-small (1536-dim) — current approach, simpler
Local/free (future): multilingual-e5-small ONNX (384-dim, 50MB) via onnxruntime-go — no API cost, better Russian quality. Requires CGO (onnxruntime-go).

Use the swappable Embedder interface so both are interchangeable.

Short query fix

For queries < 5 words: HyDE (one GPT-4o-mini call to generate hypothetical passage, embed that instead of the raw query).