Fixing 'Cannot access /dev/kfd' Docker Errors for AMD ROCm Containers

Scaling AI workloads requires predictable containerization. While the Nvidia ecosystem has a well-documented path using the NVIDIA Container Toolkit, engineering teams executing an MLOps AMD deployment often encounter hardware-mapping roadblocks.

When initializing a ROCm-based container for frameworks like PyTorch or TensorFlow, you will likely encounter the fatal RuntimeError: Cannot access /dev/kfd or a generic hipErrorNoDevice exception. This failure halts the initialization of the HIP (Heterogeneous-compute Interface for Portability) runtime, rendering the AMD GPU inaccessible to the containerized application.

To resolve this, we must bypass Docker's default device cgroup restrictions and directly map the kernel interfaces ROCm uses to communicate with the physical hardware.

The Root Cause: Understanding /dev/kfd and /dev/dri

Docker isolates containers using Linux namespaces and cgroups. By default, a container cannot access hardware devices on the host operating system.

The AMD ROCm stack relies on two critical character devices exposed by the Linux kernel:

/dev/kfd (Kernel Fusion Driver): This is the compute dispatch interface. It manages the HSA (Heterogeneous System Architecture) queues and handles memory management between the CPU and the GPU compute units. If a container cannot access this, compute operations fail immediately.
/dev/dri (Direct Rendering Infrastructure): This directory contains the DRM (Direct Rendering Manager) device nodes, specifically renderD128 (and subsequent numbers for multi-GPU setups). These nodes handle the actual execution environments and rendering contexts.

The "Cannot access /dev/kfd" error is a direct symptom of Docker blocking the container's access to these specific character devices, or a mismatch in group permissions between the host and the container.

The Fix: Correctly Configuring the AMD GPU Container

A proper ROCm Docker setup requires explicit device mapping and group permission additions. You must pass both the devices and the required user groups to the container runtime.

Solution 1: Docker CLI

If you are running standalone containers, append the following flags to your docker run command:

docker run -it \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --group-add=render \
  --security-opt seccomp=unconfined \
  rocm/pytorch:latest

Solution 2: Docker Compose (Recommended for MLOps)

For structured deployments, define the device mappings in your docker-compose.yml file. This ensures your AMD GPU container environment is reproducible.

services:
  mlops-worker:
    image: rocm/pytorch:latest
    container_name: rocm-compute-node
    # Map the required AMD kernel devices
    devices:
      - "/dev/kfd:/dev/kfd"
      - "/dev/dri:/dev/dri"
    # Append group permissions required for device access
    group_add:
      - video
      - render
    # Disable default seccomp profile for ROCm profiler compatibility
    security_opt:
      - seccomp:unconfined
    # Shared memory must be increased for PyTorch DataLoader workers
    shm_size: '8gb'
    command: ["python3", "-c", "import torch; print(f'ROCm available: {torch.cuda.is_available()}')"]

Deep Dive: Breaking Down the Configuration

Applying the fix resolves the error, but understanding why each flag is necessary is critical for production environments.

Device Mapping (`--device`)

Mapping /dev/kfd and /dev/dri instructs Docker to update the cgroup device whitelist, allowing the container processes to read and write to the specified hardware nodes. Unlike Nvidia's proprietary runtime wrapper (--gpus all), ROCm utilizes standard Linux kernel device nodes natively.

Group Permissions (`--group-add`)

Linux distributions handle DRM node ownership differently. Ubuntu typically assigns /dev/kfd to the render group, while older distributions or custom kernel builds might assign it to the video group.

Inside the container, the application process runs as a specific user (often root, but ideally a non-root application user). Even if the device is mapped, Linux permissions will block access if the container user does not belong to the correct host-equivalent group. Passing --group-add=video and --group-add=render ensures the container user inherits the necessary GIDs (Group IDs) to perform read/write operations on the hardware.

Syscall Whitelisting (`seccomp=unconfined`)

Docker’s default seccomp profile blocks certain system calls to reduce the attack surface. However, ROCm’s lower-level tooling—specifically the ROCm Profiler (rocprof) and advanced memory allocation routines—frequently use perf_event_open and process_vm_readv. Setting seccomp=unconfined allows these system calls, preventing silent performance degradations or profiling failures.

Edge Cases and Host-Level Troubleshooting

If you have applied the Docker configurations and still encounter issues, the problem resides at the host operating system level.

1. Missing Host Permissions

The user executing the Docker daemon (or the user running rootless Docker) must be a member of the video and render groups on the host machine. Verify and update host permissions:

# Check current groups
groups $USER

# Add user to required groups if missing
sudo usermod -aG video,render $USER

# A session restart or new shell is required for group changes to take effect
newgrp video
newgrp render

2. The Kernel Driver is Not Loaded

If /dev/kfd does not exist on the host, mapping it to Docker will fail. Verify the amdgpu kernel module is loaded:

lsmod | grep amdgpu

If the output is empty, the AMD kernel drivers (typically provided by the amdgpu-dkms package) are missing or failed to compile against your current kernel version. Review your host's ROCm installation.

3. IOMMU and PCIe Atomicity Failures

Data center AMD GPUs (Instinct series) and modern consumer cards (RDNA2/RDNA3) require PCIe atomics. If the host BIOS or GRUB configuration restricts IOMMU (Input-Output Memory Management Unit), the KFD driver will fail to initialize the compute nodes.

Check the kernel ring buffer for KFD errors:

dmesg | grep -i kfd

If you see errors related to PCIe atomics or IOMMU, append iommu=pt to your GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, update GRUB, and reboot the host. This passes through IOMMU handling, which ROCm relies on for memory coherence.

Conclusion

Containerizing AI workloads on AMD hardware relies entirely on native Linux device management. By explicitly passing /dev/kfd and /dev/dri, mapping the video and render groups, and managing syscall profiles, you eliminate hardware access errors. This standardizes your ROCm Docker setup, allowing seamless scaling for CI/CD pipelines and production machine learning deployments.

Programming Tutorials

Search This Blog