MongoDB WriteConflict Error: Handling Transactions in High-Concurrency Sharded Clusters

You have migrated a critical workflow to MongoDB 5.0+ or 6.0+, leveraged multi-document ACID transactions across a sharded cluster, and now your logs are flooding with this:

{
  "code": 112,
  "codeName": "WriteConflict",
  "errorLabels": ["TransientTransactionError"],
  "errmsg": "WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction."
}

In a high-throughput environment, simply "retrying" without a strategy results in a thundering herd problem, spiking CPU usage and locking latency. This post details the mechanics of WiredTiger conflicts and provides a production-grade implementation for handling them in Node.js/TypeScript.

The Root Cause: WiredTiger Snapshot Isolation

To fix the error, you must understand the storage engine. MongoDB uses WiredTiger, which employs Multi-Version Concurrency Control (MVCC) using Snapshot Isolation.

When you start a transaction (at time $T_1$), MongoDB creates a snapshot of the data. Your transaction reads from this snapshot.

Read Phase: You read Document A.
Logic Phase: Your application calculates a new value.
Commit Phase: You attempt to write the new value to Document A.

If another thread modifies Document A between $T_1$ and your commit attempt, WiredTiger detects a conflict. It cannot merge the changes safely because your logic was based on stale data (the version at $T_1$). Consequently, it aborts the transaction and throws a WriteConflict.

In Sharded Clusters, this is exacerbated. Transactions utilizing Two-Phase Commit (2PC) across shards introduce network latency between the prepare and commit phases, significantly widening the window of time in which a concurrent write can trigger a conflict.

The Fix: Idempotent Retries and Core API

Handling WriteConflict requires two layers:

Driver-Level Retry Logic: Utilizing the driver's built-in TransientTransactionError handling.
Application-Level Idempotency: Ensuring that retrying the logic doesn't duplicate side effects.

Do not write your own while(true) retry loops. The MongoDB Node.js driver provides a robust API (session.withTransaction) that handles backoff and timeout logic specifically for WriteConflict.

Production Implementation (TypeScript)

The following code demonstrates a robust inventory reservation system handling high concurrency.

import { MongoClient, ClientSession, ObjectId, TransactionOptions } from 'mongodb';

// Configuration interface
interface InventoryItem {
  _id: ObjectId;
  sku: string;
  quantity: number;
  reserved: number;
}

const uri = process.env.MONGODB_URI || 'mongodb://localhost:27017';
const client = new MongoClient(uri, { minPoolSize: 10, maxPoolSize: 100 });

/**
 * Executes a business operation within a transaction with automatic retries
 * for WriteConflict and TransientTransactionError.
 */
async function reserveInventory(sku: string, amount: number): Promise<void> {
  const session: ClientSession = client.startSession();

  const transactionOptions: TransactionOptions = {
    readPreference: 'primary',
    readConcern: { level: 'local' }, // 'snapshot' is strictly stronger but higher latency
    writeConcern: { w: 'majority' },
  };

  try {
    // withTransaction automatically handles "TransientTransactionError"
    // and retries the entire callback if a WriteConflict occurs.
    await session.withTransaction(async () => {
      const db = client.db('ecommerce');
      const inventoryColl = db.collection<InventoryItem>('inventory');
      const auditColl = db.collection('audit_logs');

      // 1. READ: Get current state
      // IMPORTANT: Pass the session to every operation
      const item = await inventoryColl.findOne(
        { sku: sku }, 
        { session }
      );

      if (!item) {
        throw new Error(`Item ${sku} not found`);
      }

      // 2. LOGIC: Check constraints based on the SNAPSHOT read
      if (item.quantity - item.reserved < amount) {
        // This aborts the transaction immediately
        throw new Error(`Insufficient stock for ${sku}`);
      }

      // 3. WRITE: Update the document
      // Note: We are strictly relying on the read snapshot.
      const updateResult = await inventoryColl.updateOne(
        { _id: item._id },
        { $inc: { reserved: amount } },
        { session }
      );

      if (updateResult.modifiedCount === 0) {
        // This forces a retry if the document matched but wasn't updated
        // (Rare in this logic, but good practice for ensuring write application)
        await session.abortTransaction();
        throw new Error('Failed to reserve inventory, triggering retry');
      }

      // 4. WRITE: Idempotent side-effect
      // We generate a unique ID for the audit log based on the operation context
      // to prevent duplicate logs if the transaction commits but the network fails ack.
      await auditColl.insertOne(
        {
          sku,
          action: 'RESERVE',
          amount,
          timestamp: new Date(),
          transactionId: session.id // Or a request-specific tracing ID
        },
        { session }
      );

    }, transactionOptions);
    
    console.log('Transaction successfully committed.');

  } catch (error: any) {
    // If we reach here, retries were exhausted or a non-transient error occurred
    console.error('Transaction failed:', error.message);
    throw error;
  } finally {
    await session.endSession();
  }
}

// Usage
await client.connect();
await reserveInventory('PROD-123', 5);

Why This Solution Works

1. The `withTransaction` Helper

The standard session.withTransaction wrapper is critical. It detects TransientTransactionError labels. When WiredTiger throws a code 112 (WriteConflict), the driver recognizes this label and re-executes the provided callback function. It includes internal logic to wait briefly before retrying, preventing an immediate CPU spin-lock.

2. Snapshot Isolation Compliance

In the code above, the read (findOne) and the write (updateOne) happen inside the callback. When a retry occurs, the read is performed again. This is vital. If you read the inventory outside the transaction and passed the value in, the retry would attempt to write based on old data, resulting in an infinite loop of conflicts.

3. Read Concern: `local` vs `snapshot`

In sharded clusters, readConcern: 'snapshot' guarantees distinct point-in-time consistency across shards but comes with higher latency. For high-velocity updates on single documents (like inventory counters), readConcern: 'local' inside the transaction is often sufficient if you are relying on the transaction's inherent atomicity guarantees, and it reduces the locking window.

Architectural Mitigation: The Pattern Fix

If code-level retries handle the error, architectural changes prevent it. If a single document is a "hot" write target (e.g., a global counter or a popular product), retries will eventually exhaust the timeout limit (default 120s).

To resolve persistent WriteConflicts in sharded clusters, apply the Bucket Pattern (collision reduction).

Instead of one document per SKU, create $N$ documents per SKU and distribute the writes randomly.

// Schema Redesign Concept
interface InventoryBucket {
  sku: string;
  bucketId: number; // 0 to 9
  quantity: number;
}

// Optimized update logic
async function incrementWithBuckets(sku: string, amount: number, session: ClientSession) {
    const randomBucket = Math.floor(Math.random() * 10);
    
    await db.collection('inventory_buckets').updateOne(
        { sku: sku, bucketId: randomBucket },
        { $inc: { quantity: amount } },
        { session, upsert: true }
    );
}

By splitting the write target into 10 buckets, you reduce the probability of a WriteConflict by approximately 90%, allowing the transaction retry logic to succeed on the first or second attempt rather than the fiftieth.

Conclusion

WriteConflict errors in MongoDB are not failures of the database; they are the mechanism by which ACID compliance is enforced in an optimistic concurrency model.

To solve them:

Use session.withTransaction to automate retries.
Ensure all reads that inform writes are contained within the transaction callback.
If conflicts persist, redesign the schema to reduce write contention on individual documents using the Bucket Pattern.

Programming Tutorials

Search This Blog