Intermittent 502 Bad Gateway errors are among the most frustrating issues to debug in a production environment. Your application traffic is flowing normally, but occasionally, HTTP requests fail with a generic 502. You check your application logs, only to find no trace of the failed request.
This happens because the failure occurred at the infrastructure or runtime boundary, before your Node.js application could log the transaction. To effectively troubleshoot Azure Web App 502 errors, you must look beyond your application code and understand the Azure platform architecture bridging your traffic.
This guide explores the architectural root causes of these errors and provides concrete, modern solutions to stabilize your Node.js deployments.
The Architecture Behind Azure App Service 502 Bad Gateway
Azure App Service utilizes a front-end load balancer called Application Request Routing (ARR). The ARR routes incoming external traffic to the worker instances hosting your application.
A 502 Bad Gateway error specifically indicates that the ARR attempted to proxy a request to your Node.js worker instance but received an invalid response, a connection refusal, or a dropped TCP connection.
For Node.js applications, this disconnect between the ARR and the worker typically stems from three root causes:
- Node.js Azure SNAT port exhaustion: The worker runs out of available outbound TCP ports, causing background connection attempts to fail, blocking the event loop, and cascading into inbound ARR request timeouts.
- Process Crashes and Container Restarts: An unhandled exception terminates the Node.js process. While the container supervisor restarts the process, the ARR continues routing traffic to a dead socket, resulting in a 502.
- Event Loop Starvation: Synchronous CPU-bound operations block the V8 event loop. The internal health probes fail to get a response within the timeout window, prompting Azure to drop the connection.
The Fix: Resolving SNAT Port Exhaustion
In Azure PaaS networking, outbound connections to public IP addresses (such as external APIs or managed databases) undergo Source Network Address Translation (SNAT). Azure App Service allocates a limited number of pre-allocated SNAT ports per worker instance (typically 128).
By default, Node.js does not pool HTTP connections efficiently across all libraries. If your application handles high concurrency and opens a new TCP connection for every external API call, you will rapidly exhaust your SNAT ports.
Implementing Global Connection Pooling
To fix this, you must explicitly configure your HTTP agents and database clients to reuse TCP connections. In modern Node.js (v18+), native fetch uses undici under the hood. You must configure a custom dispatcher to ensure persistent connections.
Here is the implementation for a modern Node.js service using Express and native fetch with an optimized connection pool:
import express from 'express';
import { Agent, setGlobalDispatcher } from 'undici';
import https from 'https';
// 1. Configure undici for native fetch connection pooling
// This prevents SNAT exhaustion by reusing TCP connections
const customDispatcher = new Agent({
keepAliveTimeout: 60000, // 60 seconds
keepAliveMaxTimeout: 60000,
connections: 100, // Max sockets per origin
pipelining: 1 // Set to 0 if the external API doesn't support pipelining
});
setGlobalDispatcher(customDispatcher);
// 2. Configure a global HTTPS agent for legacy SDKs (e.g., older Azure SDKs, Axios)
const globalHttpsAgent = new https.Agent({
keepAlive: true,
maxSockets: 100,
maxFreeSockets: 10,
timeout: 60000
});
const app = express();
app.get('/api/data', async (req, res, next) => {
try {
// This native fetch now utilizes the undici Agent with Keep-Alive
const response = await fetch('https://api.external-service.com/data');
if (!response.ok) {
throw new Error(`External API responded with ${response.status}`);
}
const data = await response.json();
res.json(data);
} catch (error) {
next(error);
}
});
const PORT = process.env.PORT || 8080;
app.listen(PORT, () => {
console.log(`Server listening on port ${PORT}`);
});
By enforcing keepAlive: true and limiting the maxSockets, you cap the maximum number of SNAT ports your application can consume, forcing the runtime to wait for an available socket rather than blindly requesting new ones from the OS.
The Fix: Graceful Shutdowns and Crash Handling
If your 502 errors align with memory spikes or unhandled promise rejections, your Node.js process is likely restarting mid-flight. When the process crashes, the OS abruptly severs active TCP connections. The ARR observes this severed connection and returns a 502 to the end user.
You must handle terminal signals and uncaught exceptions to ensure existing requests finish processing before the container restarts.
Implementing Graceful Teardown
Modify your application entry point to intercept termination signals from the Azure App Service container lifecycle:
import express from 'express';
import { createServer } from 'http';
const app = express();
const server = createServer(app);
app.get('/health', (req, res) => res.status(200).send('Healthy'));
const PORT = process.env.PORT || 8080;
server.listen(PORT, () => {
console.log(`Application started on port ${PORT}`);
});
// Handle Graceful Shutdown
const gracefulShutdown = (signal) => {
console.log(`Received ${signal}, starting graceful shutdown...`);
// 1. Stop accepting new connections
server.close(() => {
console.log('HTTP server closed.');
// 2. Close database connections here (e.g., mongoose.connection.close())
// 3. Exit process safely
process.exit(0);
});
// Force shutdown if graceful teardown takes too long (e.g., Azure limits to 30s)
setTimeout(() => {
console.error('Could not close connections in time, forcefully shutting down');
process.exit(1);
}, 10000).unref();
};
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
// Prevent Node.js from exiting immediately on unhandled errors
process.on('uncaughtException', (error) => {
console.error('Uncaught Exception:', error);
gracefulShutdown('uncaughtException');
});
process.on('unhandledRejection', (reason, promise) => {
console.error('Unhandled Rejection at:', promise, 'reason:', reason);
gracefulShutdown('unhandledRejection');
});
Deep Dive: How Azure Routes Node.js Traffic
Understanding the routing topology clarifies why these solutions are necessary. In a Linux Azure App Service plan, your code runs inside a Docker container.
The path of a request looks like this: Client -> Azure Load Balancer -> ARR -> Worker Node -> Docker Container -> Node.js Process
When you see a 502, the failure occurred specifically between the ARR and the Docker Container.
If your application takes longer than 230 seconds to respond, the Azure Load Balancer drops the connection (this is a hardcoded Azure idle timeout limit). If your Node.js event loop is blocked parsing a massive JSON payload, the Node.js process cannot respond to the ARR's internal TCP health checks. The ARR assumes the container is dead, tears down the route, and throws a 502.
Common Pitfalls and Edge Cases
1. The WEBSITES_PORT Misconfiguration
By default, Azure App Service assumes your Node.js application listens on port 8080. If your application binds to port 3000 but you do not explicitly set the WEBSITES_PORT=3000 environment variable, the container will start, but the ARR will fail to route traffic to it, resulting in continuous 502/503 errors during cold starts. Always bind to process.env.PORT.
2. "Always On" Not Enabled
If your App Service is on a Basic tier or higher, ensure Always On is toggled to true in your Configuration settings. If this is disabled, Azure will unload your application worker process after 20 minutes of inactivity. The next incoming request will trigger a cold start. If your Node.js startup sequence takes more than a few seconds (e.g., fetching secrets, compiling runtime templates), the ARR will time out before the container fully spins up.
3. Ignoring Azure Diagnostics Logs
Application Insights will not always catch a 502 error because the Node.js APM agent never receives the request. To properly troubleshoot Azure Web App 502 issues, you must enable App Service Logs (specifically HTTP logs and Docker container logs). Navigate to Diagnose and solve problems -> HTTP 5xx Errors in the Azure Portal. This blade aggregates the ARR logs and will tell you if the 502 was caused by an ARR_TIMEOUT or a WIN32_ERROR.
Conclusion
Intermittent 502 Bad Gateway errors in Node.js on Azure are rarely application code logic flaws; they are infrastructure boundaries reacting to poor resource management. By strictly enforcing outbound connection pooling to prevent SNAT port exhaustion, implementing proper lifecycle management to handle process termination, and understanding the ARR timeout limits, you can effectively eliminate these errors and stabilize your production workloads.