Skip to main content

Optimizing RAG Pipelines: reducing Latency and Hallucinations in Production

 Moving a Retrieval-Augmented Generation (RAG) prototype from a Jupyter notebook to a production environment is where the real engineering begins. In a controlled environment with 50 documents, a basic top_k vector search works perfectly. In production with 500,000 chunks, two critical issues emerge: retrieval latency spikes, and the LLM begins to "hallucinate" answers because the retrieved context—while mathematically "close" in vector space—is semantically irrelevant to the specific user query.

This post details how to implement L2 normalization for retrieval speed and a Re-ranking pipeline to eliminate context noise.

The Root Cause: High-Dimensional Noise and Index Traversal

To fix RAG, you must understand why it fails at scale.

  1. Latency & The Dot Product Shortcut: Most vector databases calculate distance using Cosine Similarity. However, calculating the magnitude of vectors during search is computationally expensive. Many production systems default to Dot Product operations because they are faster, but Dot Product only equals Cosine Similarity if the vectors are normalized (have a unit length of 1). If you aren't strictly normalizing embeddings before ingestion, your database is wasting cycles normalizing on the fly or returning inaccurate rankings based on vector magnitude rather than direction.
  2. The "Nearest Neighbor" Fallacy: Vector search algorithms (like HNSW) are approximate. They return the nearest neighbors, even if those neighbors are garbage. If a user asks a question about a topic not in your database, the vector search will still return the "closest" irrelevant chunks. The LLM, forced to use this context, will often hallucinate an answer to bridge the gap between the query and the provided text.

The Fix: Normalization, Filtering, and Cross-Encoder Re-ranking

We will implement a robust retrieval pipeline using Python, sentence-transformers, and Qdrant (as the vector engine). This solution implements:

  1. L2 Normalization: Ensuring fast Dot Product retrieval.
  2. Metadata Filtering: Reducing the search space immediately.
  3. Cross-Encoder Re-ranking: A second-pass filter to reject irrelevant context.

Prerequisites

pip install qdrant-client sentence-transformers numpy pydantic

The Implementation

This code represents a complete, modular service class for handling ingestion and retrieval.

import numpy as np
from typing import List, Dict, Optional
from datetime import datetime
from pydantic import BaseModel, Field
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer, CrossEncoder

# Configuration
VECTOR_SIZE = 384
COLLECTION_NAME = "production_rag_v1"
EMBEDDING_MODEL_ID = "all-MiniLM-L6-v2"  # Fast Bi-Encoder
RERANKER_MODEL_ID = "cross-encoder/ms-marco-MiniLM-L-6-v2" # Accurate Cross-Encoder

class DocumentChunk(BaseModel):
    content: str
    metadata: Dict[str, str] = Field(default_factory=dict)
    
class SearchResult(BaseModel):
    content: str
    score: float
    metadata: Dict[str, str]

class RAGService:
    def __init__(self):
        # Initialize Vector DB (In-memory for demo, use URL for prod)
        self.client = QdrantClient(location=":memory:")
        
        # Load Models
        # Bi-encoder for fast retrieval
        self.encoder = SentenceTransformer(EMBEDDING_MODEL_ID)
        # Cross-encoder for high-precision re-ranking
        self.reranker = CrossEncoder(RERANKER_MODEL_ID)
        
        self._init_collection()

    def _init_collection(self):
        """
        Initialize collection with Dot Product distance.
        This requires normalized vectors but offers the lowest latency.
        """
        self.client.recreate_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=models.VectorParams(
                size=VECTOR_SIZE,
                distance=models.Distance.DOT
            )
        )

    def _normalize_l2(self, vectors: np.ndarray) -> np.ndarray:
        """
        Apply L2 Normalization to vectors.
        Formula: v / ||v||
        """
        norm = np.linalg.norm(vectors, axis=1, keepdims=True)
        return vectors / norm

    def ingest_documents(self, docs: List[DocumentChunk]):
        """
        Embeds, normalizes, and upserts documents.
        """
        texts = [d.content for d in docs]
        
        # 1. Generate Embeddings
        embeddings = self.encoder.encode(texts, convert_to_numpy=True)
        
        # 2. L2 Normalize (Critical for Dot Product optimization)
        normalized_embeddings = self._normalize_l2(embeddings)
        
        # 3. Prepare Payload
        points = [
            models.PointStruct(
                id=idx,
                vector=vector.tolist(),
                payload={**doc.metadata, "content": doc.content}
            )
            for idx, (vector, doc) in enumerate(zip(normalized_embeddings, docs))
        ]
        
        self.client.upsert(
            collection_name=COLLECTION_NAME,
            points=points
        )
        print(f"✅ Ingested {len(docs)} documents.")

    def search(self, query: str, category_filter: Optional[str] = None, limit: int = 5) -> List[SearchResult]:
        """
        Hybrid retrieval pipeline: 
        Vector Search (Recall) -> Cross-Encoder Re-ranking (Precision)
        """
        # 1. Embed and Normalize Query
        query_vec = self.encoder.encode([query], convert_to_numpy=True)
        query_vec = self._normalize_l2(query_vec)[0]

        # 2. Define Metadata Filters (Pre-filtering reduces search space)
        query_filter = None
        if category_filter:
            query_filter = models.Filter(
                must=[
                    models.FieldCondition(
                        key="category",
                        match=models.MatchValue(value=category_filter)
                    )
                ]
            )

        # 3. First Pass: Retrieve Top K*3 (High Recall)
        # We fetch more than we need to allow the reranker to filter out noise
        initial_hits = self.client.search(
            collection_name=COLLECTION_NAME,
            query_vector=query_vec.tolist(),
            query_filter=query_filter,
            limit=limit * 3 
        )

        if not initial_hits:
            return []

        # 4. Second Pass: Cross-Encoder Re-ranking
        # Pair the query with every retrieved document content
        cross_inp = [[query, hit.payload['content']] for hit in initial_hits]
        cross_scores = self.reranker.predict(cross_inp)

        # Zip results with new scores and sort
        reranked_results = []
        for idx, hit in enumerate(initial_hits):
            reranked_results.append(SearchResult(
                content=hit.payload['content'],
                metadata={k: v for k, v in hit.payload.items() if k != "content"},
                score=float(cross_scores[idx]) # New precision score
            ))

        # Sort by cross-encoder score (descending)
        reranked_results.sort(key=lambda x: x.score, reverse=True)

        # 5. Threshold Filtering (The Anti-Hallucination Guardrail)
        # If the best match has a low score, return nothing rather than noise.
        # Score thresholds depend on the specific CrossEncoder model used.
        # For ms-marco-MiniLM, scores are logits; > 0 is usually relevant.
        final_results = [res for res in reranked_results if res.score > 0.0]
        
        return final_results[:limit]

# --- Usage Example ---

if __name__ == "__main__":
    rag = RAGService()

    # Seed Data
    data = [
        DocumentChunk(content="Deployment logs show a 500 error in the payment gateway.", metadata={"category": "logs", "date": "2023-10-01"}),
        DocumentChunk(content="The payment API requires an API key in the header.", metadata={"category": "docs", "date": "2023-01-01"}),
        DocumentChunk(content="The office kitchen is closed for cleaning.", metadata={"category": "general", "date": "2023-10-01"}),
        DocumentChunk(content="Latency spikes observed in the eu-west-1 region.", metadata={"category": "logs", "date": "2023-10-02"}),
    ]
    
    rag.ingest_documents(data)

    print("\n--- Query: 'Payment failures' (Filtered by category: logs) ---")
    results = rag.search("Why are payments failing?", category_filter="logs", limit=2)
    
    for r in results:
        print(f"[{r.score:.4f}] {r.content}")

    print("\n--- Query: 'Office Snacks' (Irrelevant to Tech Stack) ---")
    # This should return an empty list or very low scores due to re-ranking threshold
    results = rag.search("What snacks are available?", category_filter="logs")
    if not results:
        print("✅ No relevant results found (Hallucination prevented).")
    else:
        for r in results:
            print(f"[{r.score:.4f}] {r.content}")

Why This Works

1. L2 Normalization & Dot Product

By setting distance=models.Distance.DOT and manually normalizing via _normalize_l2, we align the mathematical operation with the underlying hardware optimization.

  • Euclidean/L2: Good, but sensitive to vector magnitude.
  • Cosine: Standard, but requires magnitude calculation during query time.
  • Dot Product (Normalized): Mathematically equivalent to Cosine similarity but significantly faster because the magnitude division is pre-calculated. This reduces latency by 20-30% on large datasets.

2. The Re-ranking Step

The standard Bi-Encoder (used for the vector search) compresses a paragraph into a single vector. This compression is "lossy." The Cross-Encoder takes the (Query, Document) pair and outputs a similarity score by processing them together through the transformer network.

  • Bi-Encoder: Fast, used for retrieval (Top-20).
  • Cross-Encoder: Slow but highly accurate, used for ranking (Top-5). By chaining them, you get the speed of vector search with the accuracy of BERT-style attention mechanisms.

3. Threshold Filtering

Notice the line: if res.score > 0.0. Standard vector search always returns results, even if the distance is huge. By applying a sigmoid or logit threshold on the Cross-Encoder score, we explicitly tell the application: "If the relevance is below X, assume we don't know the answer." This is the single most effective way to stop LLMs from hallucinating. If the retrieval layer returns nothing, the LLM can safely respond with "I don't have enough information," rather than fabricating an answer based on "The office kitchen is closed."

Conclusion

Optimizing RAG for production isn't about finding a better LLM; it's about cleaning the pipeline that feeds it. By enforcing L2 normalization, you stabilize your latency. By implementing a Re-ranking/Cross-Encoder stage with strict thresholds, you ensure that your LLM only receives high-quality context, drastically reducing the rate of hallucinations.

Popular posts from this blog

Restricting Jetpack Compose TextField to Numeric Input Only

Jetpack Compose has revolutionized Android development with its declarative approach, enabling developers to build modern, responsive UIs more efficiently. Among the many components provided by Compose, TextField is a critical building block for user input. However, ensuring that a TextField accepts only numeric input can pose challenges, especially when considering edge cases like empty fields, invalid characters, or localization nuances. In this blog post, we'll explore how to restrict a Jetpack Compose TextField to numeric input only, discussing both basic and advanced implementations. Why Restricting Input Matters Restricting user input to numeric values is a common requirement in apps dealing with forms, payment entries, age verifications, or any data where only numbers are valid. Properly validating input at the UI level enhances user experience, reduces backend validation overhead, and minimizes errors during data processing. Compose provides the flexibility to implement ...

jetpack compose - TextField remove underline

Compose TextField Remove Underline The TextField is the text input widget of android jetpack compose library. TextField is an equivalent widget of the android view system’s EditText widget. TextField is used to enter and modify text. The following jetpack compose tutorial will demonstrate to us how we can remove (actually hide) the underline from a TextField widget in an android application. We have to apply a simple trick to remove (hide) the underline from the TextField. The TextField constructor’s ‘colors’ argument allows us to set or change colors for TextField’s various components such as text color, cursor color, label color, error color, background color, focused and unfocused indicator color, etc. Jetpack developers can pass a TextFieldDefaults.textFieldColors() function with arguments value for the TextField ‘colors’ argument. There are many arguments for this ‘TextFieldDefaults.textFieldColors()’function such as textColor, disabledTextColor, backgroundColor, cursorC...

jetpack compose - Image clickable

Compose Image Clickable The Image widget allows android developers to display an image object to the app user interface using the jetpack compose library. Android app developers can show image objects to the Image widget from various sources such as painter resources, vector resources, bitmap, etc. Image is a very essential component of the jetpack compose library. Android app developers can change many properties of an Image widget by its modifiers such as size, shape, etc. We also can specify the Image object scaling algorithm, content description, etc. But how can we set a click event to an Image widget in a jetpack compose application? There is no built-in property/parameter/argument to set up an onClick event directly to the Image widget. This android application development tutorial will demonstrate to us how we can add a click event to the Image widget and make it clickable. Click event of a widget allow app users to execute a task such as showing a toast message by cli...