PostgreSQL Index Bloat: When VACUUM Isn't Enough and REINDEX CONCURRENTLY Saves Production

You have a high-throughput PostgreSQL database. You monitor your disk usage and notice your indexes are growing disproportionately to your table data. Query performance is degrading—specifically, Index Scans are becoming I/O bound. You check pg_stat_user_tables and confirm that autovacuum is running frequently.

The natural assumption is that autovacuum handles space reclamation. However, in write-heavy environments—especially those with random-access updates or deletes—autovacuum often fails to shrink the physical footprint of an index. The index becomes "bloated," containing massive amounts of empty space that the OS filesystem still has to cache or read.

Historically, reclaiming this space required REINDEX, which locks the table against writes—a non-starter for 24/7 production systems. The solution introduced in PostgreSQL 12 (and refined since) is REINDEX CONCURRENTLY.

The Root Cause: MVCC and B-Tree Fragmentation

To fix the problem, you must understand why autovacuum isn't enough.

PostgreSQL uses Multi-Version Concurrency Control (MVCC). When you UPDATE a row, Postgres doesn't overwrite the data in place. It marks the old tuple (row version) as dead and inserts a new tuple.

1. The Limits of VACUUM

VACUUM marks dead tuples as reusable. In a heap (table data), this works reasonably well. However, B-Tree indexes operate differently.

Page Splits: B-Trees are organized into pages (usually 8KB). When a page is full and a new key is added, the page splits.
Reusable vs. Reclaimable: VACUUM will remove pointers to dead tuples inside an index page, marking that space as "available for new keys." However, it cannot easily merge two half-empty pages back together or truncate the file unless the empty pages are at the physical end of the file.

2. The Fragmentation Spiral

If your application performs random updates (e.g., updating a last_login timestamp or a status column), your index becomes a Swiss cheese of empty pockets.

If a page is 90% empty, it still takes up 8KB on disk.
Your buffer cache fills up with mostly empty index pages.
Your IOPS skyrocket because you are reading 100GB of data to satisfy a query that should only touch 10GB.

The Fix: Identification and Concurrent Reindexing

This solution assumes PostgreSQL 12 or higher.

Step 1: Identify Bloated Indexes accurately

Don't guess. Use the pgstattuple extension for precise measurements on specific suspects, or use a heuristic query for a cluster-wide scan.

For a rigorous check on a specific index, enable the extension and check density:

CREATE EXTENSION IF NOT EXISTS pgstattuple;

-- deeply inspect a specific index (can be I/O intensive, run with care on massive tables)
SELECT * FROM pgstatindex('public.users_email_idx');

Output Analysis: Look at avg_leaf_density. If this is below 60-70 for a static or insert-only table, or below 50 for a heavy-update table, you have significant bloat.

For a fast, cluster-wide estimation (without reading the actual index data), use this query based on system statistics:

SELECT
    current_database(), 
    schemaname, 
    tablename, 
    /* checking if the relation is an index */
    indexname, 
    /* current size of the index */
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, 
    /* estimation of the bloat in bytes */
    pg_size_pretty(pg_relation_size(indexrelid) - ( pg_relation_size(indexrelid) * (100 - bloat_ratio) / 100)::bigint) as bloat_size,
    /* ratio of bloat */
    bloat_ratio
FROM (
    SELECT
        schemaname, tablename, indexname, indexrelid,
        CASE WHEN pg_relation_size(indexrelid) < 1000000 THEN 0 ELSE 
            (100 - (pg_stat_get_live_tuples(indexrelid) * 100 / greatest(pg_relation_size(indexrelid) / 
            (CASE WHEN relpages = 0 THEN 1 ELSE relpages END), 1))) 
        END as bloat_ratio
    FROM pg_stat_user_indexes
    JOIN pg_class ON pg_class.oid = pg_stat_user_indexes.indexrelid
) as bloat_data
WHERE bloat_ratio > 30 -- Filter for indexes with > 30% bloat
ORDER BY pg_relation_size(indexrelid) DESC;

Step 2: Reindex Concurrently

Once you have identified the users_email_idx is 4GB but contains only 500MB of data, execute the concurrent reindex.

-- This command will NOT lock the table for writes.
-- It may take significantly longer than a standard REINDEX.
REINDEX INDEX CONCURRENTLY public.users_email_idx;

If you need to reindex an entire table's indexes:

REINDEX TABLE CONCURRENTLY public.users;

Step 3: Handling Failures (Invalid Indexes)

REINDEX CONCURRENTLY is not atomic in the same way a standard transaction is. If the command is interrupted (e.g., query timeout, DB restart, deadlock), it may leave behind an "invalid" index with a suffix like _ccnew.

You must clean these up, or they will consume disk space and updates without being used for queries.

Check for invalid indexes:

SELECT n.nspname as schema_name,
       c.relname as index_name,
       pg_size_pretty(pg_relation_size(c.oid)) as size
FROM pg_index i
JOIN pg_class c ON c.oid = i.indexrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE i.indisvalid = false
  AND i.indisready = true; -- Index is being maintained but not used for queries

Fixing an invalid index: You simply run the REINDEX command again. Postgres detects the broken previous attempt and rebuilds it.

REINDEX INDEX CONCURRENTLY public.users_email_idx;

The Explanation: How It Works Without Locks

Standard REINDEX takes an ACCESS EXCLUSIVE lock on the table, blocking all reads and writes. REINDEX CONCURRENTLY avoids this through a multi-step process that only requires SHARE UPDATE EXCLUSIVE locks (which conflict only with schema changes like ALTER TABLE, not with SELECT, INSERT, UPDATE, or DELETE).

Here is the internal lifecycle of the command:

Create Shadow Index: Postgres creates a new temporary index (suffixed usually with _ccnew) alongside the existing one.
Initial Build: It populates the new index with data currently in the table. While this happens, writes (INSERT/UPDATE/DELETE) are forwarded to both the old index and the new shadow index.
Catch-up Phase: Postgres waits for any transactions that started before the index build began to finish (to ensure MVCC consistency).
Swap: Once the new index is up to date, Postgres swaps the metadata. The new index becomes the "live" index, and the old index is marked as "dead".
Drop: The old index file is deleted, reclaiming the disk space.

Resource Considerations

While this saves you from downtime, it is not free:

I/O Load: You are reading the entire table and writing a new index. On I/O-bound systems, this can degrade query performance.
Disk Space: During the operation, you need enough free disk space to store both the old index and the new index simultaneously.
Deadlock Risk: While rare, concurrent reindexing can deadlock with other maintenance operations.

Conclusion

Index bloat is an inevitable consequence of high-write workloads in PostgreSQL. autovacuum keeps the table healthy but often leaves indexes fragmented. By monitoring pg_stat_user_indexes and pgstattuple, and utilizing REINDEX INDEX CONCURRENTLY, you can reclaim gigabytes of RAM and disk space and drastically reduce I/O latency—all without taking a millisecond of maintenance downtime.

Programming Tutorials

Search This Blog