search_vector_research
Vector Search Research
Summary of Findings
Research on best practices for semantic search with text-embedding-3-small over an Obsidian Markdown vault.
1. Chunking
Should you chunk? Yes, always. Embedding an entire note (2000–10000 tokens) degrades retrieval quality because the embedding averages over too many topics. Even short notes benefit from consistent chunking.
Recommended chunk size for Markdown: 400–512 tokens with 50–100 tokens overlap (10–20%). Use a semantic/recursive splitter on ["\n\n", "\n", ". ", " "] — respects heading and paragraph boundaries. Fixed-size splitting is worse.
Parent Document Retriever pattern (dominant in production RAG):
- Embed and index only small child chunks (200–512 tokens) for vector search
- Store full note keyed by
doc_idmetadata on every child chunk - On query: find top-K matching child chunks → fetch full parent note per
doc_id - Gives precise retrieval (small chunks match better) + full context for display
- Natural parent boundary for Obsidian: the full note file
2. Short Query vs Long Document Problem
Single-word query ("space") vs long document gives low cosine similarity (~0.2–0.3) even for relevant pairs. This is expected — the query is a sparse point, the document is a centroid over many concepts.
Fixes by implementation cost:
| Technique | How | Cost | Gain |
|---|---|---|---|
| Prepend title to chunk | Add note title + H1 to each chunk before embedding | Zero | Low-medium |
| Query expansion (LLM) | Rewrite short query into 2–3 sentences before embedding | +1 LLM call | High |
| HyDE | Generate hypothetical answer doc, embed that | +1 LLM call | High |
| Multi-query | 3–5 query variants, union results | +1 LLM call | Medium |
| BM25 hybrid (already done) | FTS + vector + RRF | Already implemented | High for keyword queries |
Easiest quick win: prepend note title and H1 heading to every chunk text at index time. Zero cost.
HyDE for short queries: when query length < 5 words, use GPT-4o-mini to generate a hypothetical passage that would answer the question, then embed that passage. Document-to-document similarity is much higher than query-to-document.
3. HyDE (Hypothetical Document Embeddings)
- Take the user query
- Prompt LLM: "Write a short passage that would answer: [query]"
- Embed the generated hypothetical document
- Search corpus with that embedding
- Discard hypothetical document — never shown to user
Why it helps: the hypothetical doc lives in the same dense region as real documents. Particularly strong for multilingual retrieval. Outperforms BM25 and unsupervised dense retrieval on benchmark tasks.
When it hurts: LLM hallucination in domain-specific/private vaults can drift the embedding away from the real corpus.
Verdict: implement as optional path for short queries (< 5 words).
4. Similarity Threshold Calibration
text-embedding-3-small returns lower scores than ada-002 (~5-10% lower). Community rule of thumb: start at 0.25–0.30, not 0.70+.
Fixed threshold is a bad idea as sole gate. Two failure modes:
- Too high → zero results for valid queries
- Too low → noise floods results
Practical approach:
- Start with top-K (K=5–10) without threshold during development
- Sample 50–100 relevant + 50–100 irrelevant query/note pairs from your vault
- Plot score distribution, set threshold at the natural gap
- Hybrid gate in production:
score >= threshold OR rank <= 3— always return at least 1 result
Expected similarity values for text-embedding-3-small (Russian):
| Pair type | Expected cosine similarity |
|---|---|
| Highly relevant (same topic, similar phrasing) | 0.55–0.75 |
| Relevant (same topic, different phrasing) | 0.35–0.55 |
| Tangentially related | 0.20–0.35 |
| Unrelated | 0.10–0.25 |
| Short 1–3 word query vs full document | 0.15–0.30 even for relevant pairs |
Russian scores ~5–10% lower than English equivalents (training data is English-dominant).
5. Asymmetric Search — Better Models for Russian
text-embedding-3-small is symmetric (single encoder for query and document). Dedicated asymmetric models are better for retrieval.
Best open-source alternatives for Russian (ruMTEB benchmark):
| Model | Notes |
|---|---|
| multilingual-e5-large-instruct | Best on ruMTEB; use query: prefix for queries, passage: for docs |
| BGE-M3 | Best on Russian retrieval specifically; supports dense + sparse; 100+ languages |
| ru-en-RoSBERTa | Bilingual RU+EN; competitive at lower cost |
Instruction-tuned models (Instruct: Retrieve passages that answer the question) directly address the asymmetric query/document problem without needing HyDE.
Verdict: text-embedding-3-small is pragmatic if already using OpenAI API. Upgrade to BGE-M3 or mE5large-instruct only if measured recall is unsatisfactory.
6. Implementation Roadmap (Ordered by ROI)
- Top-K without threshold — remove fixed threshold during development, inspect real distributions
- Prepend note title to every chunk — zero cost, improves short query recall
- Chunking (400–512 tokens, Markdown boundaries) + Parent Document Retriever pattern
- HyDE for short queries (< 5 words) — one GPT-4o-mini call
- Threshold from data — calibrate from vault-specific query/note pairs
- BGE-M3 or mE5large-instruct if Russian quality insufficient
7. Reference Implementations
tobi/qmd (TypeScript, local-only)
Offline semantic search over Markdown via MCP. Dual-model hybrid:
- SQLite FTS5 (BM25) + sqlite-vec cosine similarity
- Local models: EmbeddingGemma-300M or Qwen3-Embedding-0.6B (GGUF via node-llama-cpp)
- Query prompt template:
"task: search result | query: {text}"
Chunking (sophisticated, markdown-aware):
- Target ~900 tokens, 15% overlap
- Boundary scoring: headings (50–100 pts), code fences (80 pts), paragraph breaks (20 pts)
- Squared-distance decay within ±200-token window
- Code fences never split mid-block
Full pipeline:
Query → LLM query expansion (3 variants)
→ BM25 + vector search per variant
→ RRF (k=60)
→ Top 30 candidates
→ Cross-encoder reranker (Qwen3-Reranker-0.6B)
→ Position-aware blend
→ Final results
Applicable to trip2g: sqlite-vec, markdown-aware chunking, RRF fusion. Query expansion and cross-encoder only if latency allows.
tomohiro-owada/devrag (Go + CGO)
Token-efficient RAG for Claude Code. Pure vector search, no BM25. Written in Go — directly portable.
- multilingual-e5-small (384-dim) via ONNX Runtime in-process
- sqlite-vec for storage and cosine search
- Clean
Embedderinterface (Embed / EmbedBatch / Close) — swappable: mock uses deterministic SHA256-based vectors for unit tests - Simple fixed-size 500-char chunking (no markdown awareness — upgrade this)
- Incremental indexing with glob config
Key Go libraries:
| Library | Purpose |
|---|---|
mattn/go-sqlite3 / modernc.org/sqlite |
SQLite driver (already in use) |
asg017/sqlite-vec |
Vector similarity extension |
yalue/onnxruntime_go |
In-process ONNX inference (CGO) |
| FTS5 | Built into SQLite |
Note (trip2g): trip2g uses
modernc.org/sqlite(pure Go, no CGO). sqlite-vec requires CGO and loadable extensions — incompatible with the current driver. Switching tomattn/go-sqlite3breaks pure-Go builds and complicates cross-compilation. Not worth it at current scale (~300 notes).
8. Recommended Architecture for trip2g
Combining both repos + research findings:
Storage
Use sqlite-vec — same SQLite file, no new infrastructure. FTS5 already in SQLite for BM25.
Chunking
Split at heading boundaries, protect code fences, target 600–900 tokens, 10–15% overlap. Prepend note title + H1 to each chunk before embedding.
Retrieval
query
→ SQLite FTS5 BM25 ← already implemented (Bleve → migrate to FTS5)
→ sqlite-vec cosine similarity ← new
→ RRF(k=60) fusion ← already implemented
→ top-N results (no fixed threshold, or very low ~0.25)
Embedding model options
- Hosted: OpenAI
text-embedding-3-small(1536-dim) — current approach, simpler - Local/free (future):
multilingual-e5-smallONNX (384-dim, 50MB) viaonnxruntime-go— no API cost, better Russian quality. Requires CGO (onnxruntime-go).
Use the swappable Embedder interface so both are interchangeable.
Short query fix
For queries < 5 words: HyDE (one GPT-4o-mini call to generate hypothetical passage, embed that instead of the raw query).