search

Search Architecture

Trip2g uses hybrid search: full-text search (FTS) via Bleve combined with vector semantic search via OpenAI embeddings, merged with Reciprocal Rank Fusion (RRF).

Components

Full-Text Search (Bleve)

  • In-memory index rebuilt on every vault reload
  • Russian morphology analyzer (ru) for stemming
  • AND operator — all query terms must appear in the document
  • Indexes: note title + extracted plain text (frontmatter stripped, code blocks included)
  • Result: highlighted title and body snippets

Entry point: internal/noteloader/search.gobuildSearchIndex, Search.

Vector Search

  • Embeddings generated by OpenAI (text-embedding-3-small, 1536 dims) via background job
  • Stored in note_version_embeddings table (BLOB, float32 LE)
  • Loaded into model.NoteView.Embedding at vault load time via assignEmbeddings
  • At query time: query text → OpenAI embedding → cosine similarity against all notes with embeddings
  • Threshold: vectorMinSimilarity = 0.40 (notes below threshold are excluded)
  • Top 30 candidates passed to RRF merger

Entry point: internal/case/sitesearch/resolve.govectorSearch.

Hybrid Merge (RRF)

Reciprocal Rank Fusion combines FTS and vector results:

score(doc) = Σ 1 / (k + rank)   for each list where doc appears
k = 60 (standard RRF constant)
  • Rank-based, not score-based — avoids normalization issues between BM25 and cosine
  • Vector-only results get a generated text snippet (first 150 chars, frontmatter skipped)
  • Final list capped at 20 results

Feature Flag

Vector search requires:

  1. features.VectorSearch.Enabled = true (config)
  2. OpenAI API key configured

Without these, only FTS runs. FTS always runs regardless.

Embedding Generation

Background job generatenoteversionembedding runs when a note version is created:

  1. Fetches note from in-memory cache
  2. Computes sha256(title + raw_content) — skips if embedding already up to date
  3. Sends title + "\n\n" + raw_content to OpenAI (includes frontmatter YAML)
  4. Stores float32 LE bytes + model_id + content_hash in note_version_embeddings

Re-generation job: internal/case/cronjob/regeneratenoteembeddings.

Search Exclusion

Notes can be excluded from search index:

---
search: false
---

System notes (path starts with _ or contains /_) are also excluded from FTS indexing.
Exclusion is applied during buildSearchIndex; vector search currently does not apply this filter at query time.

Testing

Manual golden set

To verify search quality, run the server and check these query → expected note pairs:

Query Expected note
как учёные ищут планеты у других звёзд demo/search_astronomy.md
ключевые слова теги demo/search_keywords.md

Calibrating the similarity threshold

Add temporary debug logging in vectorSearch to see real similarity values:

env.Logger().Debug("similarity", "path", note.Path, "sim", similarity)

Run a few representative queries, collect the similarity distribution:

  • Relevant notes: typical range 0.40–0.60 for Russian question→document with text-embedding-3-small
  • Noise: typically < 0.35

Set vectorMinSimilarity at the gap between the two groups.

Checking embedding freshness

-- Notes where latest version has no embedding
SELECT np.value, nv.id
FROM note_paths np
JOIN note_versions nv ON nv.path_id = np.id AND np.version_count = nv.version
LEFT JOIN note_version_embeddings nve ON nve.version_id = nv.id
WHERE nve.version_id IS NULL AND np.hidden_by IS NULL;
-- Embedding model distribution
SELECT model_id, COUNT(*) as cnt, LENGTH(embedding)/4 as dims
FROM note_version_embeddings
GROUP BY model_id, LENGTH(embedding)/4;