How to Fix 'PyTorch not compiled with ROCm' on AMD GPUs

If you are transitioning to a PyTorch AMD GPU environment for model training or inference, you have likely encountered an immediate roadblock. When attempting to move a tensor to the GPU using .to('cuda') or calling .cuda(), the interpreter throws an exception indicating that PyTorch was not compiled with ROCm or CUDA enabled.

This error brings development to a halt. The hardware is physically present, and the system drivers may be installed correctly, but the Python runtime refuses to utilize the GPU.

Resolving this requires replacing the default PyTorch binaries with a build specifically compiled against AMD’s ROCm (Radeon Open Compute) stack.

Understanding the Root Cause

To fix PyTorch CUDA error exceptions on AMD hardware, you must understand how Python package distribution works.

When you run a standard pip install torch, pip reaches out to the default Python Package Index (PyPI). Due to package size limits and historical dominance, the official PyTorch binaries uploaded to the primary PyPI registry are compiled exclusively for NVIDIA CUDA or CPU-only environments.

PyTorch relies on AMD's HIP (Heterogeneous-compute Interface for Portability) to translate CUDA calls into ROCm-compatible instructions. This architectural decision is highly beneficial for developers, as it means you do not need to rewrite your .cuda() calls for AMD hardware. However, it requires a specialized binary where this HIP translation layer was compiled into the C++ backend during the build process.

Because the standard PyPI package lacks this HIP/ROCm compilation, any attempt to access the GPU fails, resulting in the runtime exception.

The Solution: A Proper ROCm PyTorch Install

To fix this, you must instruct pip to bypass the default PyPI registry and fetch the ROCm-compiled wheels directly from PyTorch's dedicated package index.

Step 1: Purge the Existing Installation

Before installing the correct binaries, you must completely remove the existing CPU or NVIDIA-compiled packages. Failing to do so often results in pip utilizing cached, incompatible wheels.

Execute the following command in your terminal or virtual environment:

pip uninstall torch torchvision torchaudio triton -y
pip cache purge

Step 2: Install the ROCm Binaries

Next, initiate the ROCm PyTorch install. At the time of writing, ROCm 6.2 is the recommended stable target for modern Machine Learning AMD environments.

Use the --index-url flag to point pip to the ROCm specific repository:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

Note: The download size is substantial (often exceeding 2GB) because it bundles the necessary ROCm runtime libraries. Ensure you have a stable connection.

Step 3: Verify the Installation

Once the installation completes, you must verify that the HIP translation layer is active and successfully communicating with the AMD GPU driver. Run the following Python script to confirm the integration:

import torch

def verify_amd_gpu():
    print(f"PyTorch Version: {torch.__version__}")
    
    # PyTorch maintains the .cuda() API for AMD ROCm
    rocm_available = torch.cuda.is_available()
    print(f"ROCm Available: {rocm_available}")
    
    if rocm_available:
        print(f"Device Name: {torch.cuda.get_device_name(0)}")
        print(f"HIP Version: {torch.version.hip}")
        
        # Test tensor allocation on the AMD GPU
        tensor = torch.randn(3, 3).cuda()
        print(f"Tensor successfully allocated on: {tensor.device}")
    else:
        print("Error: PyTorch cannot detect the ROCm runtime.")

if __name__ == "__main__":
    verify_amd_gpu()

If the output confirms the device name and shows ROCm Available: True, the issue is resolved.

Deep Dive: How PyTorch Handles AMD GPUs

A common point of confusion for engineers new to the Machine Learning AMD ecosystem is the continued use of the cuda namespace.

You might wonder why we call torch.cuda.is_available() instead of torch.rocm.is_available(). This is an intentional design pattern by the PyTorch core team. By overloading the cuda namespace, PyTorch ensures that millions of lines of existing open-source ML code, repositories, and libraries run flawlessly on AMD hardware without requiring a single line of code alteration.

Under the hood, the ROCm-compiled PyTorch wheel intercepts these CUDA API calls and maps them to AMD HIP API calls. The tensor is allocated in the VRAM of the Radeon GPU exactly as it would be on an NVIDIA device.

Common Pitfalls and Edge Cases

Unsupported Consumer RDNA GPUs

Official ROCm support targets AMD Instinct data center accelerators and high-end workstation cards (e.g., Radeon Pro series). If you are attempting this on a consumer Radeon RX 6000 or 7000 series GPU, torch.cuda.is_available() may still return False even after a correct installation.

To fix this, you must override the hardware architecture check by setting the HSA_OVERRIDE_GFX_VERSION environment variable.

For RDNA 3 architectures (Radeon RX 7000 series), use gfx1100:

export HSA_OVERRIDE_GFX_VERSION=11.0.0
python your_training_script.py

For RDNA 2 architectures (Radeon RX 6000 series), use gfx1030:

export HSA_OVERRIDE_GFX_VERSION=10.3.0
python your_training_script.py

Windows Host Limitations

Historically, ROCm native support has been strictly limited to Linux. While AMD has released experimental ROCm support for Windows natively, the most stable and performant pathway for PyTorch on a Windows machine is via WSL2 (Windows Subsystem for Linux).

If you are on Windows, ensure you are running your Python environment inside an Ubuntu WSL2 instance, and that your host Windows AMD drivers are fully updated. The WSL2 kernel will pass the GPU through to the Linux environment, allowing the Linux ROCm wheels to interface with the hardware seamlessly.

Conclusion

Resolving the "PyTorch not compiled with ROCm" error requires bypassing default PyPI behaviors and explicitly installing the ROCm-targeted binaries. By ensuring you have the correct --index-url and applying architecture overrides for consumer-grade RDNA cards, you can unlock full hardware acceleration. The abstraction provided by AMD's HIP layer ensures that once the environment is properly configured, your existing ML workloads will execute without requiring codebase migrations.

Programming Tutorials

Search This Blog