Skip to main content

AWS SSM Error: Fixing 'TargetNotConnected' in Session Manager

 There is perhaps no frustration in the AWS ecosystem quite like the "TargetNotConnected" error in Systems Manager (SSM). You have an EC2 instance. The status checks are green (2/2 passed). Security groups are locked down, as they should be. Yet, when you attempt to start a session, the console rejects you:

"An error occurred (TargetNotConnected) when calling the StartSession operation: The target ... is not connected to SSM."

This error is misleading. It implies the instance is offline, but usually, the OS is running perfectly. The issue lies in the communication channel between the SSM Agent and the AWS control plane.

This guide provides a rigorous root cause analysis and a step-by-step technical fix for this specific error, moving beyond generic advice to infrastructure-level debugging.

The Root Cause: It’s a Pull, Not a Push

To fix this, you must understand how Session Manager works. Unlike SSH, where you open a port (22) and "push" a connection to the server, SSM is agent-based.

The SSM Agent running on your EC2 instance initiates the connection outbound. It acts as a worker polling the Systems Manager service for tasks.

For a session to be established, three strict conditions must be met simultaneously:

  1. Identity: The instance must have an IAM Role with permission to speak to the SSM API.
  2. Path: The instance must have a route to reach AWS SSM public endpoints (via Internet Gateway) or private endpoints (via VPC Endpoints).
  3. Handshake: The agent must be running and able to perform the TLS handshake.

If any of these links break, the API returns TargetNotConnected.

Phase 1: Validating the IAM Role (The Most Common Culprit)

An EC2 instance cannot interact with SSM without an Instance Profile. A common mistake is attaching a role that has S3 permissions but lacks the core SSM policies.

The Fix

Ensure your EC2 instance has an IAM role attached containing the AmazonSSMManagedInstanceCore managed policy.

If you are using Terraform, your role definition should look like this. Note that we strictly use assume_role_policy for EC2 trust relationships:

resource "aws_iam_role" "ssm_role" {
  name = "ec2-ssm-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "ssm_core" {
  role       = aws_iam_role.ssm_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

resource "aws_iam_instance_profile" "ssm_profile" {
  name = "ec2-ssm-profile"
  role = aws_iam_role.ssm_role.name
}

Verification: Run this AWS CLI command locally to verify the instance actually has the profile attached:

aws ec2 describe-iam-instance-profile-associations \
  --filters Name=instance-id,Values=i-0123456789abcdef0

If the state is not associated, attach the role immediately.

Phase 2: Network Reachability (The Silent Blocker)

If permissions are correct, the issue is almost certainly networking. The SSM Agent communicates over HTTPS (Port 443).

Scenario A: Public Subnet

If your instance is in a public subnet, verify the Security Group attached to the instance allows Outbound traffic on port 443 to 0.0.0.0/0.

Note: Inbound rules do not affect SSM. You can have empty Inbound rules and SSM will still work.

Scenario B: Private Subnet (The "Hybrid" Trap)

This is where most DevOps engineers get stuck. If your instance is in a private subnet, it has no direct internet access. It cannot reach the public SSM API endpoints.

You have two solutions:

  1. NAT Gateway: Route traffic to a NAT Gateway in a public subnet.
  2. VPC Endpoints (Recommended): Keep traffic entirely within the AWS network.

If you use VPC Endpoints (PrivateLink), you strictly need three specific endpoints attached to your VPC. Missing one will cause the TargetNotConnected error.

  1. com.amazonaws.[region].ssm (The core API)
  2. com.amazonaws.[region].ec2messages (The message delivery channel)
  3. com.amazonaws.[region].ssmmessages (The Session Manager channel)

Here is the Terraform configuration to ensure these exist:

locals {
  services = {
    "ssm"          = "com.amazonaws.us-east-1.ssm"
    "ssmmessages"  = "com.amazonaws.us-east-1.ssmmessages"
    "ec2messages"  = "com.amazonaws.us-east-1.ec2messages"
  }
}

resource "aws_vpc_endpoint" "ssm_endpoints" {
  for_each            = local.services
  vpc_id              = var.vpc_id
  service_name        = each.value
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  
  # CRITICAL: The Security Group on the Endpoint must allow 
  # Inbound 443 from the EC2 Instance's subnet/IP.
  security_group_ids  = [aws_security_group.vpce_sg.id]

  private_dns_enabled = true
}

Critical Detail: Ensure private_dns_enabled is set to true. This ensures the SSM Agent's DNS lookup resolves to the internal VPC IP, not the public internet IP.

Phase 3: The IMDSv2 Hop Limit Edge Case

If IAM and Networking are perfect, but the instance is running Docker or acts as a router/proxy, you might be hitting the Instance Metadata Service Version 2 (IMDSv2) hop limit.

The SSM Agent retrieves credentials from the metadata service at 169.254.169.254. If you have enforced IMDSv2 (which you should for security), the default HttpPutResponseHopLimit is 1.

If the agent is running inside a container, or if there is a sophisticated routing table on the OS, one hop is not enough. The request dies, the agent gets no credentials, and the instance never registers.

The Fix

Increase the hop limit to 2 or 3 via the AWS CLI:

aws ec2 modify-instance-metadata-options \
    --instance-id i-0123456789abcdef0 \
    --http-put-response-hop-limit 2 \
    --http-endpoint enabled

You do not need to restart the instance for this to take effect.

Phase 4: Automated Diagnosis via AWS Support

If you still cannot connect, do not blindly reboot. Use the AWSSupport-TroubleshootManagedInstance automation document. This executes a specialized diagnostic flow that checks the instance from the "outside."

  1. Go to Systems Manager > Automation.
  2. Click Execute automation.
  3. Search for AWSSupport-TroubleshootManagedInstance.
  4. Input your Instance ID (i-xxxx).
  5. Execute.

This tool will query the VPC configuration, route tables, and IAM profile associations, returning a specific "Pass/Fail" for every requirement listed above.

Summary

The TargetNotConnected error is rarely a random glitch; it is a configuration enforcement.

  1. Check IAM: Ensure AmazonSSMManagedInstanceCore is attached.
  2. Check DNS/Network: Verify Outbound 443 or the presence of all three VPC Endpoints (ssmssmmessagesec2messages).
  3. Check Metadata: If using containers, bump the IMDSv2 hop limit to 2.

By methodically validating the "Identity, Path, and Handshake," you can resolve connectivity issues without ever needing to open port 22.