Skip to main content

Posts

Showing posts with the label Microservices

Saga Pattern Failure Modes: Handling Failed Compensating Transactions

  You have implemented the Saga pattern to manage distributed transactions across your microservices. You successfully moved away from Two-Phase Commit (2PC) to improve availability. You have defined your transaction steps ($T_1, T_2, T_3$) and their corresponding compensating actions ($C_1, C_2, C_3$). But here is the scenario that keeps distributed systems engineers up at night:  What happens when $C_2$ fails? $T_1$ (Order Created) succeeded. $T_2$ (Payment Captured) succeeded. $T_3$ (Allocate Inventory) failed. The Saga coordinator initiates the rollback. It successfully executes $C_3$ (noop), but when it attempts $C_2$ (Refund Payment), the Payment Gateway returns a 504 Gateway Timeout or, worse, a 400 Bad Request. Your system is now in a "Zombie" state. The customer has been charged, the inventory was not allocated, and the automated refund failed. The system is inconsistent, and standard rollback mechanisms have exhausted themselves. The Root Cause: The Fallacy of Guara...