Skip to main content

PostgreSQL pgvector Optimization: Tuning HNSW vs IVFFlat for Billion-Scale Embeddings

 

The "Day Two" Scaling Trap

Building a vector search prototype with PostgreSQL and pgvector is deceptively simple. You spin up a Docker container, ingest 10,000 document chunks, run a standard Cosine Similarity query, and get results in 15 milliseconds.

Then "Day Two" arrives. You move to production with 50 million vectors (e.g., OpenAI’s text-embedding-3-small at 1536 dimensions). Suddenly, your index build takes 14 hours, your RAM usage spikes to 100%, causing OOM kills, and your simple SELECT query takes 2 seconds to return.

The default configuration of pgvector is not tuned for high-throughput or large datasets. Scaling requires a deliberate choice between IVFFlat (Inverted File Flat) and HNSW (Hierarchical Navigable Small World) indexes, and more importantly, tuning the critical build-time and query-time parameters that govern the precision-performance trade-off.

Root Cause: The Geometry of Indexing

The performance degradation stems from the fundamental difference in how these indexes map high-dimensional space and how PostgreSQL manages memory during these operations.

1. The IVFFlat Bottleneck (The Cluster Problem)

IVFFlat works by using K-Means clustering to divide the vector space into "lists" (centroids).

  • The Issue: It requires a training step. If your data distribution changes (drift) after the index is built, the centroids become suboptimal, tanking recall.
  • The Search Cost: To find a neighbor, the database identifies the closest centroid and scans all vectors in that cluster. If you have 100M vectors and too few lists, you are essentially doing a sequential scan on a massive subset of data.

2. The HNSW Trade-off (The Memory Problem)

HNSW builds a multi-layered graph. The bottom layer contains all data points; upper layers act as express highways to navigate to the correct neighborhood.

  • The Issue: HNSW is memory-greedy. It stores the graph structure (links between nodes) alongside the data.
  • The Scan: It performs a greedy traversal. If the graph doesn't fit in the Postgres shared_buffers (or OS page cache), you trigger massive random disk I/O (IOPS), which is the primary killer of vector search latency.

The Fix: Optimized Configuration and Indexing strategies

Below is a rigorous setup for a high-scale vector workload (50M+ rows). We will optimize the Postgres environment first, then implement a tuned HNSW index, as it offers the best recall/performance ratio for production workloads, provided you tune the memory correctly.

1. Environmental Tuning (Crucial Prerequisites)

Before creating the index, you must tune Postgres to allow parallel workers to build the graph efficiently. Run this in your session or postgresql.conf.

-- Maximize parallel workers for index creation (Postgres 15+)
-- If you have 16 vCPUs, reserve some for OS/other connections.
ALTER SYSTEM SET max_parallel_maintenance_workers = 12;

-- Increase memory for maintenance operations (Index creation)
-- This does NOT affect standard query RAM.
-- Set this to at least 2GB - 4GB depending on available RAM.
ALTER SYSTEM SET maintenance_work_mem = '4GB';

-- Reload config without restart
SELECT pg_reload_conf();

2. Schema Definition

We use halfvec (available in pgvector 0.7.0+) if precision loss is acceptable (usually negligible for RAG) to cut memory usage by 50%, or standard vector for full precision.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE embeddings (
    id BIGSERIAL PRIMARY KEY,
    content TEXT,
    -- Using 1536 dimensions (OpenAI standard)
    -- Using halfvec reduces index size by ~50%
    embedding vector(1536) 
);

-- Turn off autovacuum temporarily during massive bulk ingestion
-- to prevent resource contention, then turn back on.
ALTER TABLE embeddings SET (autovacuum_enabled = false);

3. The HNSW Strategy (Recommended for Performance)

HNSW is generally superior to IVFFlat for query speed and recall, but it requires tuning m and ef_construction.

The Math:

  • m (Max connections per layer): Controls memory consumption per row.
    • Default: 16.
    • Optimization: Range 16-64. Higher m = better recall, much higher memory usage, slower build. Keep at 16 unless recall is poor, then step to 24.
  • ef_construction (Candidate list size during build): Controls index quality.
    • Default: 64.
    • Optimization: Range 64-512. Higher = longer build time, but significantly better index quality (recall). This does not impact search speed, only build time.
-- Re-enable autovacuum before indexing
ALTER TABLE embeddings SET (autovacuum_enabled = true);

-- Create the HNSW Index
-- calculating opclass: vector_cosine_ops, vector_l2_ops, or vector_ip_ops
CREATE INDEX CONCURRENTLY idx_embeddings_hnsw 
ON embeddings 
USING hnsw (embedding vector_cosine_ops) 
WITH (
    m = 16, 
    ef_construction = 128
);

4. The IVFFlat Strategy (Legacy/Low-Memory Constraint)

Use this only if your RAM is severely constrained and you cannot fit the HNSW graph in memory. The golden rule for IVFFlat is calculating the lists parameter.

The Math:

  • For < 1M rows: lists = rows / 1000
  • For > 1M rows: lists = sqrt(rows)

Example for 50 Million vectors: sqrt(50,000,000) ≈ 7071.

-- Ensure you have data loaded BEFORE creating IVFFlat
-- IVFFlat needs data to calculate centroids (K-Means)

CREATE INDEX CONCURRENTLY idx_embeddings_ivfflat
ON embeddings 
USING ivfflat (embedding vector_cosine_ops) 
WITH (
    lists = 7000
);

5. Query-Time Optimization

Creating the index is only half the battle. You must tell the query planner how aggressively to search the graph.

-- FOR HNSW:
-- ef_search: Size of the dynamic list for the nearest neighbors during search.
-- Default: 40.
-- Trade-off: Higher = Better Recall, Slower Query.
-- Rule of Thumb: Start low, increase until recall plateaus.

SET hnsw.ef_search = 100;

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, 1 - (embedding <=> '[...query vector...]'::vector) as similarity
FROM embeddings
ORDER BY embedding <=> '[...query vector...]'::vector
LIMIT 10;
-- FOR IVFFlat:
-- probes: How many lists (centroids) to check.
-- Default: 1. (Terrible recall)
-- Rule of Thumb: sqrt(lists). For 7000 lists, probe ~80.

SET ivfflat.probes = 80;

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content
FROM embeddings
ORDER BY embedding <=> '[...query vector...]'::vector
LIMIT 10;

Technical Explanation: Why These Numbers Matter

The HNSW Memory/Recall Curve

The parameter m defines the maximum number of bi-directional links created for every element in the graph.

  • Why m=16? HNSW memory usage is roughly size_of_vector + (m * 4 bytes). At 1536 dimensions, the vector data dominates the size. Increasing m to 64 yields diminishing returns on recall but bloats the index size, pushing it out of the cache.
  • Why ef_construction=128? This parameter builds a "deeper" entry point map. During insertion, the algorithm searches ef_construction neighbors to find the best links. Increasing this doubles build time but results in a graph that is much easier to traverse during the SELECT phase, effectively "pre-paying" the compute cost.

The Buffer Cache Limit

PostgreSQL relies heavily on the OS page cache. HNSW is fast because it is a graph traversal. If the nodes of the graph (pages) are in RAM, the hop from node A to node B is nanoseconds. If the node is on disk, it is milliseconds.

  • The Cliff: If your index size (e.g., 50GB) exceeds your available RAM (e.g., 32GB), performance doesn't degrade linearly—it crashes. The disk I/O wait times will dominate the query execution time.
  • Solution: If you hit this wall, switch to halfvec (2 byte floats) to halve the storage requirement, allowing twice the vectors to fit in the same RAM footprint.

Conclusion

For billion-scale (or even 50M+ scale) vector workloads in PostgreSQL:

  1. Prefer HNSW over IVFFlat unless you are strictly RAM-bound.
  2. Tune Build Parameters: Set ef_construction to at least 128 to ensure index quality. Keep m conservative (16-24) to manage memory size.
  3. Tune Query Parameters: Set hnsw.ef_search dynamically based on your latency SLA (typically 40-100).
  4. Hardware Reality: Ensure RAM > Index Size. If you cannot afford the RAM, quantization (halfvec) is a better optimization path than switching to the inferior IVFFlat algorithm.