One of the most jarring transitions for engineers moving from EC2 to AWS Fargate is the loss of direct server access. When a process hangs or a configuration file behaves unexpectedly, you can no longer simply ssh into the host. There is no host.
For a long time, debugging Fargate tasks required cumbersome workarounds involving sidecars or reverting to EC2-backed ECS. However, the introduction of ECS Exec solved this by leveraging AWS Systems Manager (SSM).
This guide details exactly how to implement ECS Exec to gain an interactive shell inside your serverless containers, covering the IAM requirements, infrastructure changes, and networking nuances required for production environments.
The Architecture: Why Standard SSH Fails
To solve this problem, we must understand the abstraction. In a standard EC2 environment, you control the network interface (ENI) and the operating system. You install an SSH daemon, manage keys, and open port 22.
In Fargate, AWS manages the underlying OS. Your container is the boundary. There is no SSH daemon running, and you cannot mount volume keys into the host OS.
ECS Exec bypasses the need for SSH entirely. It works by injecting the SSM Agent (a binary used by Systems Manager) into your container alongside your application code. This agent establishes a control channel with the Systems Manager service. When you run a command from your local CLI, the request is routed through the SSM API, down to the SSM Agent inside your container, which then spawns a shell process.
This approach offers distinct security advantages:
- No open inbound ports (port 22 remains closed).
- Access is controlled via IAM policies, not shared PEM keys.
- All sessions are logged and auditable via CloudTrail.
Prerequisites: The Tooling
Before configuring the cloud infrastructure, ensure your local environment is equipped. You cannot use a standard SSH client for this.
- AWS CLI v2: Ensure you are running a recent version.
- Session Manager Plugin: The AWS CLI delegates the actual data streaming to this plugin.
# Verify installation
session-manager-plugin --version
If this command returns an error, install the plugin from the AWS official documentation.
Step 1: Configuring IAM Permissions
The most common point of failure is permissions. Fargate tasks utilize two distinct roles:
- Task Execution Role: Used by the ECS agent (pulling images, sending logs).
- Task Role: Used by the container itself (accessing S3, DynamoDB, SSM).
For ECS Exec to work, the Task Role must have permissions to communicate with the SSM service.
Add the following policy statement to your ECS Task Role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
],
"Resource": "*"
}
]
}
Note: If you are using Terraform, attach the arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy to the Execution Role, but you must manually add the SSM permissions above to the Task Role.
Step 2: Enabling ECS Exec on the Service
The SSM agent is not injected by default to save resources. You must explicitly enable the enableExecuteCommand flag on your ECS Service or Task Definition.
Option A: Using Terraform (Infrastructure as Code)
If you manage your stack via Terraform, modify your aws_ecs_service resource:
resource "aws_ecs_service" "api" {
name = "production-api"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.main.arn
desired_count = 2
launch_type = "FARGATE"
# CRITICAL: This enables the SSM sidecar injection
enable_execute_command = true
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
}
Option B: Using AWS CLI
If you need to debug a running service immediately without a deployment pipeline, update the service via CLI:
aws ecs update-service \
--cluster my-cluster-name \
--service my-service-name \
--enable-execute-command \
--force-new-deployment
Important: The --force-new-deployment flag is mandatory. The SSM agent is only injected when the container starts. Existing running tasks will not be accessible until they are replaced by new tasks.
Step 3: Shelling into the Container
Once the deployment rolls out and the new tasks are RUNNING, retrieve the Task ID.
# List tasks to get the ID
aws ecs list-tasks --cluster my-cluster-name
Execute the interactive command. By default, this attempts to launch /bin/sh.
aws ecs execute-command \
--cluster my-cluster-name \
--task <YOUR_TASK_ID> \
--container <YOUR_CONTAINER_NAME> \
--interactive \
--command "/bin/sh"
If successful, your terminal prompt will change, indicating you are now root inside the Fargate container.
Troubleshooting & Common Pitfalls
If you receive an error like The Session Manager plugin was not found or the command simply hangs, check these specific root causes.
1. Network Connectivity (The Silent Killer)
The SSM Agent inside the container must be able to reach AWS Systems Manager endpoints (specifically ssmmessages).
If your Fargate tasks are in private subnets (which they should be), they cannot reach the internet directly. You have two options:
- NAT Gateway: Ensure the private route table routes
0.0.0.0/0to a NAT Gateway. - VPC Endpoints (PrivateLink): If you have no NAT Gateway for security/cost reasons, you must provision VPC Endpoints for:
com.amazonaws.region.ssmmessagescom.amazonaws.region.ecr.dkr(for pulling images)com.amazonaws.region.logs(for CloudWatch)
Without a path to ssmmessages, the agent starts but cannot register itself, rendering the shell inaccessible.
2. KMS Encryption Issues
If you have configured ECS Exec logging to use KMS encryption (for auditing session logs in S3 or CloudWatch), the Task Role requires kms:Decrypt permissions for that specific key. If the role cannot decrypt the session key, the connection terminates immediately.
3. Init Process & Zombie Processes
When you run commands interactively, you might generate zombie processes if the container's entrypoint isn't an init system (like Tini).
To mitigate this, you can enable the initProcessEnabled flag in your Task Definition. This injects Tini ensures proper signal handling and zombie reaping.
# Terraform Example
resource "aws_ecs_task_definition" "main" {
family = "service"
container_definitions = jsonencode([
{
name = "app"
image = "my-app:latest"
# ...
linuxParameters = {
initProcessEnabled = true
}
}
])
}
Automating Verification
AWS provides a script to verify all prerequisites (IAM, Networking, Agent Status) for a specific task. This is invaluable for debugging "why" a connection fails.
Run this locally to diagnose connection issues:
# Download the official checker script
curl -O https://raw.githubusercontent.com/aws-containers/amazon-ecs-exec-checker/main/check-ecs-exec.sh
# Make executable
chmod +x check-ecs-exec.sh
# Run against your task
./check-ecs-exec.sh --cluster my-cluster --task <TASK_ID>
Conclusion
ECS Exec transforms the Fargate debugging experience from a "black box" frustration into a manageable, transparent process. By leveraging SSM Session Manager, you gain secure, audited, and keyless access to your containers without compromising the serverless nature of Fargate.
Remember that while powerful, this feature should primarily be used for debugging. Avoid using execute-command for operational tasks like database migrations or configuration updates; those belong in your deployment pipeline or dedicated initialization tasks.