Building a Hybrid Search Knowledge Base for AI Agents

Why Hybrid Search Matters for RAG

Pure vector search fails on exact keyword matches — ask for "Q3 Annual Performance Report" and a semantic-only retrieval might return broadly related documents. Pure BM25 keyword search misses conceptual similarity — search for "revenue decline" and miss a document discussing "decreased earnings." A hybrid Knowledge Base service solves this by running both searches in parallel and fusing results using Reciprocal Rank Fusion (RRF), giving the best of both worlds. This dramatically reduces LLM hallucination rates.

Three Retrieval Modes in FastAPI

The retrieval API exposes three modes: Semantic (KNN-only, using Vertex AI embeddings), Full-Text (BM25 multi-match, with query sanitization and escaping), and Hybrid (RRF fusion of both). The hybrid mode uses Elasticsearch's native RRF retriever, which combines rank positions from both search strategies without needing to normalize scores. This makes scoring extremely robust and easy to tune.

python

# Simplified retrieval endpoint from src/app/entrypoints/api.py
@router.post("/retrieve")
async def retrieve_documents(
    query: RetrievalRequest,
    es_client: ElasticsearchAdapter = Depends(),
    db: AsyncSession = Depends(get_session)
):
    # Fetch embeddings from Vertex AI
    embedding = await generate_embedding(query.text)
    
    # Run hybrid RRF search in Elasticsearch
    results = await es_client.hybrid_search(
        query=query.text,
        embedding=embedding,
        doc_ids=query.document_ids,
        top_k=query.top_k
    )
    
    # Enrich matches with surrounding text for parent-child context
    if query.include_context:
        results = await enrich_with_neighbors(db, results, query.context_window)
    return results

Hybrid Retrieval Flow

sequenceDiagram participant Client participant API as KB Retrieval API participant AI as Vertex AI Embeddings participant ES as Elasticsearch participant DB as PostgreSQL Client->>API: Hybrid Query API->>AI: Generate 768d Embedding API->>ES: RRF Search (KNN + BM25) Note over ES: Reciprocal Rank Fusion ES-->>API: Fused Ranked Chunks API->>DB: Fetch Neighbor Chunks (±window) DB-->>API: Context Chunks API-->>Client: Enriched Results

Neighbor Context Enrichment

A key insight: the best-matching chunk often lacks surrounding context that an LLM needs to generate a complete answer. Our service fetches neighbor chunks (configurable context_window) from PostgreSQL based on chunk position, then merges them as [PREV] + [MAIN] + [NEXT] before returning results. This "parent-child" chunking strategy dramatically improves LLM answer quality while keeping the retrieval index focused on small, precise chunks. This keeps the prompt token usage highly optimized.

Document Ingestion Pipeline

Documents flow in asynchronously: the API receives metadata, copies the source file to GCS, publishes an ingestion task to Pub/Sub, and returns 202 Accepted with a job ID. A Cloud Run Job picks up the message, runs the document through Google DocumentAI for OCR/layout parsing, chunks the content, generates embeddings via Vertex AI, and indexes everything into Elasticsearch. The entire pipeline is event-driven and scales independently from the retrieval service, ensuring zero lag for real-time users.

Lessons Learned: RRF Tuning Gotchas

One major gotcha with Reciprocal Rank Fusion is the constant "k" value parameter (typically defaulted to 60). If you have highly specialized vocabulary, BM25 needs a much higher priority than semantic similarity. Instead of using a static RRF score directly, we learned to filter candidates at a very low threshold before merging them, preventing completely irrelevant semantic "near misses" from polluting the top-ranked keyword matches. Tuning this threshold reduced RAG retrieval noise by over 40%.

Building a Hybrid Search Knowledge Base for AI Agents

More Recent Posts

Hello World: Vibe Coding This Blog with Gemini

Routing AI Traffic: GKE Istio vs Cloud Run Load Balancers

Eliminating Dockerfiles with Cloud Native Buildpacks