Designing an enterprise Salesforce integration around a Salesforce event-driven architecture often works flawlessly in sandbox environments. However, once production loads scale, architectural cracks begin to show. You experience silent data drops, unhandled external API timeouts, and the dreaded LIMIT_EXCEEDED errors when hitting hourly publishing limits.
When building resilient external integrations, treating Platform Events simply as asynchronous triggers is insufficient. You must explicitly manage batch processing, transient failures, and strict platform limits.
This guide details the technical root causes behind dropped messages and exhausted limits, providing a production-ready architectural pattern leveraging EventBus.RetryableException and Dead Letter Queues (DLQ).
Understanding the Failure Mechanisms
To prevent dropped events and limit breaches, you must understand how Salesforce processes High-Volume Platform Events under the hood.
1. The Hourly Publishing Limits Reality
Salesforce enforces strict Salesforce Platform Events limits on both the publishing and delivery (CometD/PubSub API) sides. A common architectural flaw is treating event publishing like standard DML within a loop. Publishing a single event consumes one unit against your hourly allocation. If a bulk API update modifies 10,000 records and your trigger publishes an event for each row individually, you will rapidly exhaust your hourly allowance, causing subsequent transactions to fail globally across the org.
2. The Silent Drop of Automated Process Triggers
When an external integration fails (e.g., a REST API timeout or a database row lock), standard Apex exceptions terminate the transaction. Because Platform Event triggers run asynchronously under the Automated Process user, there is no synchronous UI layer to surface the error. The event batch is considered "processed" by the Event Bus, the Replay ID advances, and the message is permanently dropped.
To prevent this, Salesforce provides EventBus.RetryableException. However, it introduces a new risk: if you throw this exception blindly, the trigger will retry up to 9 times. On the 10th failure, the batch is silently discarded.
The Solution: Checkpoint Resumption and Dead Letter Queues
A robust enterprise Salesforce integration must implement a three-tiered defense mechanism:
- Bulkified Publishing: Aggregate events before publishing.
- Targeted Retries: Use
EventBus.RetryableExceptionstrictly for transient errors (e.g., timeouts, row locks), while utilizingsetResumeCheckpoint()to avoid reprocessing successful events in the batch. - Dead Letter Queue (DLQ): Route events to a persistent storage object when the retry threshold is near its maximum or when non-transient errors (e.g., NullPointerExceptions) occur.
The Implementation
Below is a modern, production-ready Apex implementation demonstrating this pattern.
/**
* @description Handler for Order_Integration__e Platform Event.
* Implements Checkpoint Resumption, Transient Retries, and a DLQ.
*/
public inherited sharing class OrderIntegrationEventHandler {
// Salesforce allows up to 9 retries (10 total executions).
// We intercept at 8 to ensure we can gracefully route to the DLQ on the final attempt.
private static final Integer MAX_RETRIES = 8;
public static void handle(List<Order_Integration__e> events) {
EventBus.TriggerContext context = EventBus.TriggerContext.currentContext();
Integer currentRetryCount = context.retries;
List<Integration_Error_Log__c> dlqRecords = new List<Integration_Error_Log__c>();
for (Integer i = 0; i < events.size(); i++) {
Order_Integration__e evt = events[i];
try {
// Attempt the external system integration or complex DML
processSingleIntegration(evt);
} catch (System.CalloutException calloutEx) {
// Transient Error: External system is down or timing out
handleTransientError(evt, calloutEx, currentRetryCount, dlqRecords);
} catch (System.DmlException dmlEx) {
if (dmlEx.getMessage().contains('UNABLE_TO_LOCK_ROW')) {
// Transient Error: Record is currently locked by another process
handleTransientError(evt, dmlEx, currentRetryCount, dlqRecords);
} else {
// Fatal DML Error (e.g., Validation Rule, Missing required field)
dlqRecords.add(createDlqRecord(evt, dmlEx));
}
} catch (Exception ex) {
// Fatal Error: Code-level issue (NPE, MathException)
// Do not retry. Route immediately to DLQ.
dlqRecords.add(createDlqRecord(evt, ex));
}
}
// Persist all fatal/exhausted errors to the Dead Letter Queue
if (!dlqRecords.isEmpty()) {
insert dlqRecords;
}
}
/**
* @description Handles transient errors by either setting a resume checkpoint
* and throwing a RetryableException, or routing to the DLQ if retries are exhausted.
*/
private static void handleTransientError(
Order_Integration__e evt,
Exception ex,
Integer currentRetryCount,
List<Integration_Error_Log__c> dlqRecords
) {
if (currentRetryCount < MAX_RETRIES) {
// Set the checkpoint to the current event.
// The next retry execution will start specifically at this ReplayId.
EventBus.TriggerContext.currentContext().setResumeCheckpoint(evt.ReplayId);
// Halt the current batch execution and queue a retry.
throw new EventBus.RetryableException('Transient error encountered. Retrying from ReplayId: ' + evt.ReplayId);
} else {
// Retry limit exhausted. Acknowledge event by catching error, but log to DLQ.
dlqRecords.add(createDlqRecord(evt, ex));
}
}
private static void processSingleIntegration(Order_Integration__e evt) {
// Implementation for callout or complex processing
// Note: In bulk scenarios, you would aggregate these, but for demonstration
// of Checkpoint Resumption, processing is evaluated per-event.
}
private static Integration_Error_Log__c createDlqRecord(Order_Integration__e evt, Exception ex) {
return new Integration_Error_Log__c(
Event_Payload__c = JSON.serialize(evt),
Error_Message__c = ex.getMessage(),
Stack_Trace__c = ex.getStackTraceString(),
Replay_Id__c = evt.ReplayId,
Status__c = 'Requires Manual Intervention'
);
}
}
Deep Dive: Why This Architecture Works
1. The setResumeCheckpoint() Optimization
When an EventBus.RetryableException Apex exception is thrown without a checkpoint, the Event Bus resets the pointer to the beginning of the original trigger batch. If your batch contains 2,000 events and event #1,999 fails with a timeout, throwing a generic retry will force Salesforce to reprocess all 1,998 successful events. This results in duplicate callouts and rapid consumption of external API rate limits.
By calling EventBus.TriggerContext.currentContext().setResumeCheckpoint(evt.ReplayId), you instruct the Event Bus to commit the successful processing of all preceding events and resume the next execution attempt strictly starting from the failed Replay ID.
2. Differentiating Transient vs. Fatal Errors
A common anti-pattern is wrapping the entire trigger execution in a generic try/catch and throwing a RetryableException for any error. If an event payload causes a NullPointerException, retrying it 9 times will not fix the data anomaly. It merely wastes CPU time and delays the processing of healthy events queued behind it.
The architecture above explicitly limits retries to CalloutException and UNABLE_TO_LOCK_ROW errors.
3. The 8-Retry Cutoff Strategy
Salesforce hard-caps retries at 9. If an exception is thrown on the 9th retry (the 10th overall execution), the platform enters a fatal state, drops the remaining batch, and no further Apex executes. By manually intercepting the execution at context.retries == 8, the code gracefully absorbs the final failure, commits the failed payload to a persistent Integration_Error_Log__c custom object, and allows the Event Bus pointer to move forward.
Common Pitfalls and Edge Cases
Bulkifying Event Publishing
To avoid hitting Salesforce Platform Events limits, never call EventBus.publish() inside a for loop. Always populate a List<SObject> and publish the collection.
// ANTI-PATTERN: Consumes 200 limits
for (Account acc : newAccounts) {
EventBus.publish(new Account_Created__e(AccountId__c = acc.Id));
}
// OPTIMAL: Consumes 1 limit (batch)
List<Account_Created__e> eventsToPublish = new List<Account_Created__e>();
for (Account acc : newAccounts) {
eventsToPublish.add(new Account_Created__e(AccountId__c = acc.Id));
}
EventBus.publish(eventsToPublish);
Callout Limits Inside Retry Loops
Standard synchronous Apex limits apply to Platform Event triggers. A single execution context can make a maximum of 100 callouts. If your batch size is 2,000 events and you perform a callout per event, you will hit the System.LimitException: Too many callouts error.
To handle high-volume event-driven architecture callouts, you must either:
- Aggregate payloads and perform a single bulk callout to the external system.
- Override the default Platform Event trigger batch size natively using the
PlatformEventSubscriberConfigMetadata API, reducing the batch size from 2,000 to 100 or less.
Automated Process User Context
Because Platform Event triggers run as the Automated Process user, Integration_Error_Log__c records created by the DLQ will be owned by this system user. Ensure that your org-wide defaults (OWD) and sharing rules allow your integration teams to view and modify records owned by Automated Process, otherwise your administrative team will be unable to manually replay failed events.
Conclusion
Building a resilient Salesforce event-driven architecture requires anticipating failure. By explicitly tracking the TriggerContext.retries property, setting Replay ID checkpoints, and establishing a persistent Dead Letter Queue, you transform Platform Events from a volatile messaging layer into a highly reliable enterprise integration bus.