Skip to main content

UB Vector Embeddings

Benchmarked against: Anthropic — Embeddings Model: Gemini embedding-001 (768 dimensions) Infrastructure: Cloudflare Vectorize

Vector embeddings power the semantic search in Cloud UB. When text is ingested, it's converted to a numerical vector that captures its meaning. Similar content produces similar vectors, enabling "find things like this" searches.


How embeddings work in UB

Embedding pipeline

  1. Ingestion: Content → Gemini embedding-001 → 768-dimensional vector → stored in Vectorize
  2. Search: Query → same embedding model → query vector → cosine similarity against all stored vectors
  3. Ranking: Combine vector similarity score with FTS5 keyword score → hybrid ranked results

Embedding model details

PropertyValue
ProviderGoogle (Gemini)
Modelembedding-001
Dimensions768
Max input~2,048 tokens per embedding
Language supportEnglish (excellent), Chinese (good)
CostIncluded in Gemini API pricing (very low)

Why Gemini embedding-001?

FactorDecision
CostVery low — pennies per thousand embeddings
QualityGood multilingual support
IntegrationNative Cloudflare Vectorize support
Dimensions768 — good balance of quality vs. storage

Vectorize infrastructure

PropertyValue
ServiceCloudflare Vectorize
Index typeHNSW (approximate nearest neighbor)
Similarity metricCosine similarity
Free tier30M queries/month
StoragePart of Workers infrastructure

Hybrid search scoring

Search results combine two scores:

SourceWeightWhat it finds
FTS5 keyword scoreVariesExact term matches
Vector similarity scoreVariesSemantically similar content

The combined score determines final ranking. Entries matching both keyword and semantic get the highest scores.


Embedding quality considerations

FactorImpact
Content lengthShort entries may have weak embeddings — less semantic signal
Content languageEnglish content embeds better than Chinese
Content specificitySpecific content embeds better than generic
Duplicate contentNear-duplicates cluster together — may reduce search diversity

Best practices for embeddable content

DoDon't
Write descriptive titlesUse vague titles ("Notes")
Include key terms in contentRely only on tags for discoverability
Keep entries self-containedWrite entries that depend on context
Use English for all UB contentMix languages within entries

Regeneration

Vector embeddings are regenerable — they can be rebuilt from entry content:

ScenarioAction
Vectorize index corruptionRe-embed all entries from D1
Embedding model upgradeRe-embed with new model
Content updateEntry re-embedded automatically

This makes embeddings a low-criticality backup item — they don't need to be backed up separately.


PageRelationship
Search BrainHow search uses embeddings
Cloud UB MCPVectorize infrastructure
Engine MigrationEmbedding model migration
Multilingual SupportLanguage considerations