Tuning PostgreSQL pgvector: Balancing Recall vs. Latency with HNSW Indexes

In the lifecycle of a Retrieval-Augmented Generation (RAG) application, there is a specific breaking point. It usually happens when your vector dataset migrates from a clean 50k-row proof-of-concept to a messy, real-world dataset of 1 to 10 million rows.

Suddenly, sub-millisecond queries spike to 300ms+, or worse, your LLM starts hallucinating because the vector database failed to retrieve the most semantically relevant chunks.

The culprit is almost always a misunderstanding of the Hierarchical Navigable Small Worlds (HNSW) index parameters in pgvector. Developers often apply the default index settings, unaware that HNSW is an approximate algorithm where latency and recall are diametrically opposed trade-offs.

The Root Cause: How HNSW Breaks Down

Unlike IVFFlat (Inverted File Flat), which partitions space into clusters, HNSW builds a multi-layered graph. Think of it as a skip-list for vectors.

Upper Layers: sparse graphs with long "highways" that allow the algorithm to jump across the vector space quickly.
Lower Layers: dense graphs for fine-grained traversal to find local neighbors.

When you run a query, pgvector enters the graph at a high layer and greedily traverses down to the nearest neighbors.

The performance issues at scale stem from two specific mechanical failures:

The "Stranded" Node (Low Recall): If the graph doesn't have enough connections (edges) per node, the traversal might get stuck in a "local minimum." It thinks it found the best match, but the actual best match is in a different cluster, and there was no edge connecting them.
The Over-Scanner (High Latency): To compensate for poor connectivity, you force the query to keep a larger "candidate list" in memory during traversal. This increases accuracy but forces the CPU to calculate distance metrics for thousands of vectors per query, killing throughput.

To fix this, we must tune the index build parameters (m, ef_construction) and the runtime parameter (hnsw.ef_search) independently.

The Solution: A Data-Driven Tuning Strategy

Do not guess parameters. You must measure Recall (the percentage of true top-K nearest neighbors found) against Queries Per Second (QPS).

Step 1: Establish a Recall Baseline

Before tuning, you need a script to measure how accurate your current index is compared to a "perfect" execution. We do this by forcing a brute-force KNN search (exact match) and comparing it to the HNSW (approximate) result.

Run this SQL block to calculate your recall for a random sample of your data.

WITH random_probes AS (
    -- Select 10 random vectors to test against
    SELECT embedding 
    FROM documents 
    ORDER BY random() 
    LIMIT 10
),
exact_results AS (
    -- Brute force calculation (Ground Truth)
    -- Forces a sequence scan by ignoring the index
    SELECT 
        p.embedding as probe_vec,
        d.id
    FROM random_probes p
    CROSS JOIN LATERAL (
        SELECT id 
        FROM documents 
        ORDER BY embedding <=> p.embedding 
        LIMIT 10
    ) d
),
approx_results AS (
    -- HNSW calculation (The Index we are testing)
    SELECT 
        p.embedding as probe_vec,
        d.id
    FROM random_probes p
    CROSS JOIN LATERAL (
        SELECT id 
        FROM documents 
        ORDER BY embedding <=> p.embedding 
        LIMIT 10
    ) d
),
stats AS (
    SELECT 
        count(*) as total_matches,
        (SELECT count(*) FROM exact_results) as total_expected
    FROM exact_results e
    JOIN approx_results a ON e.id = a.id AND e.probe_vec = a.probe_vec
)
SELECT 
    total_matches::float / total_expected::float as recall_score
FROM stats;

If your recall_score is below 0.95 (95%), your RAG application is missing relevant context 5% of the time.

Step 2: Optimizing Index Construction

The CREATE INDEX parameters define the shape of the graph. These are immutable once the index is built.

m (Max Connections): The number of bi-directional links created for every new element.
- Default: 16
- Scale Recommendation: For 1M+ rows, increase to 32 or 64. This increases memory usage and build time significantly but creates a "denser" highway system, reducing the chance of getting stranded in a local minimum.
ef_construction (Candidate List Size during Build): The size of the dynamic list used to find candidate neighbors while building the graph.
- Default: 64
- Scale Recommendation: Set this to 2x to 4x your m value. A higher value here results in a higher quality graph (better connections) but does not affect query latency, only build time.

The Optimized Migration:

-- Drop the old index (warning: this will impact perf during operation)
DROP INDEX IF EXISTS documents_embedding_idx;

-- Recreate with tuned parameters for a 1M+ row dataset
-- Using cosine distance (<=>) as it works best for OpenAI/SBERT embeddings
CREATE INDEX documents_embedding_idx 
ON documents 
USING hnsw (embedding vector_cosine_ops) 
WITH (
    m = 32,               -- Higher connectivity
    ef_construction = 128 -- Deeper search during graph build
);

Step 3: Tuning Runtime Latency (`hnsw.ef_search`)

This is the most critical knob for RAG developers. ef_search determines how many candidates the database tracks during a select query.

Low ef_search (e.g., 40): Blazing fast, lower recall.
High ef_search (e.g., 200): Slower, maximum recall.

Crucially, you do not need to rebuild the index to change this. You can set it per transaction or per connection. This allows you to dynamically adjust based on load.

In a Node.js/TypeScript backend (using pg or an ORM like Prisma/Drizzle), set this before your query:

import { Pool } from 'pg';

const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
});

async function searchVectors(queryEmbedding: number[]) {
  const client = await pool.connect();
  
  try {
    // Start transaction
    await client.query('BEGIN');

    // Tweak recall vs latency for this specific transaction
    // 40 is default. 100 is a good starting point for high accuracy.
    await client.query('SET LOCAL hnsw.ef_search = 100');

    const result = await client.query(
      `SELECT id, content, 1 - (embedding <=> $1) as similarity
       FROM documents
       ORDER BY embedding <=> $1
       LIMIT 5`,
      [JSON.stringify(queryEmbedding)]
    );

    await client.query('COMMIT');
    return result.rows;
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
}

Why This Works

Increasing m to 32: We doubled the number of edges per node. In high-dimensional space (like OpenAI's 1536 dimensions), nodes are sparse. More edges ensure that the traversal algorithm has more pathways to "escape" a local cluster and find the true nearest neighbor.
Increasing ef_construction to 128: During the build phase, the algorithm looked at 128 neighbors before deciding which 32 edges (m) to keep. This ensures that the edges we do keep are the highest quality connections, creating a "Small World" graph that is easier to navigate.
Dynamic ef_search: By setting SET LOCAL hnsw.ef_search, we decouple the index structure from the query execution. If your app is under heavy load, you can programmatically lower ef_search to 40 to save CPU cycles. If accuracy is paramount (e.g., a legal or medical RAG bot), you can bump it to 200.

Conclusion

Default pgvector settings favor write speed and small datasets. For production RAG applications at scale, you must intervene.

Prioritize Recall during the index build (high m, high ef_construction) to ensure the graph is navigable. Then, tune Latency at runtime using hnsw.ef_search. This approach ensures your LLM gets the right context without waiting 500ms for the database to respond.

Programming Tutorials

Search This Blog