Skip to main content

Fixing 'Ollama Not Using GPU' in Docker: A Guide for WSL2 and Linux

 There are few things more frustrating in AI engineering than watching a powerful Llama 3 or Mistral model crawl at 0.5 tokens per second. You have an RTX 3090 or a hefty server GPU, yet your Dockerized Ollama instance insists on burning up your CPU cores instead.

If you are running Ollama inside a Docker container and it fails to detect your NVIDIA GPU, the issue is rarely with Ollama itself. The problem lies in the isolation layer between the Docker daemon and the host kernel’s graphics drivers.

This guide provides the architectural root cause and the specific, copy-paste configurations required to force GPU passthrough on both native Linux and WSL2 environments.

The Root Cause: Why Docker Isolates Your GPU

To fix the issue, you must understand the "gap" in the architecture. Docker containers share the host's OS kernel but maintain their own user space (filesystem, libraries, and binaries).

By default, a container acts as a clean slate. It does not have access to the host's PCI devices, nor does it have the proprietary NVIDIA driver libraries (libnvidia-container.solibcuda.so) mapped into its memory space.

When Ollama initializes, it queries the system for accessible accelerators (specifically looking for CUDA endpoints). In a standard docker run environment without specific flags and runtime hooks, these endpoints simply do not exist. Ollama catches the exception and silently falls back to CPU inference.

To bridge this gap, we rely on the NVIDIA Container Toolkit. This acts as a wrapper around runc (the default container runtime). It utilizes a prestart hook to mount the necessary driver files and device nodes from the host into the container at runtime.

Prerequisites

Before configuring Docker, ensure your host environment is prepared.

For Native Linux (Ubuntu/Debian/CentOS)

You must have the proprietary NVIDIA drivers installed on the host OS.

# Check if drivers are active on the host
nvidia-smi

If this command fails, install the drivers via your package manager immediately. Docker cannot pass through what the host cannot see.

For WSL2 Users

Do not install NVIDIA drivers inside the WSL2 Linux distribution. WSL2 uses a unique architecture where the Linux kernel proxies calls to the Windows host. You only need the NVIDIA drivers installed on your Windows host system. The driver is projected into WSL2 automatically at /usr/lib/wsl/lib.

Step 1: Installing the NVIDIA Container Toolkit

This is the bridge that allows the container to request GPU access.

Native Linux Installation

If you are on Ubuntu or Debian, run the following commands to configure the repository and install the toolkit.

# 1. Configure the repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 2. Update and Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 3. Configure the Docker daemon to use the toolkit
sudo nvidia-ctk runtime configure --runtime=docker

# 4. Restart Docker to apply changes
sudo systemctl restart docker

WSL2 Installation

If you are using Docker Desktop for Windows, this is pre-configured. Ensure "Use the WSL 2 based engine" is checked in settings.

If you are running the Docker engine natively inside a WSL2 distro (bypassing Docker Desktop), follow the Native Linux Installation steps above exactly. WSL2 is binary compatible with standard Ubuntu/Debian repositories.

Step 2: Running Ollama with GPU Flags

The most common failure point is the docker run command itself. You must explicitly request the GPU resource. The --gpus all flag instructs the Docker daemon to utilize the NVIDIA runtime hooks we just installed.

The Correct Docker Run Command

docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If you need a specific GPU (for multi-gpu setups), replace --gpus=all with --gpus '"device=0"'.

Step 3: Deployment via Docker Compose

In a production DevOps environment, you should be using docker-compose.yml rather than imperative shell commands.

The syntax for GPU reservation changed in recent Compose file specifications. Do not use the legacy runtime: nvidia syntax. Use the deploy block resources reservation.

Here is a valid, modern compose.yaml for Ollama:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_storage:/root/.ollama
    # The critical GPU configuration
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all # Or set to 1, 2, etc.
              capabilities: [gpu]
    restart: always

volumes:
  ollama_storage:

Run this with:

docker compose up -d

Deep Dive: Verifying the Fix

Do not assume it is working just because the container started. We need to verify that Ollama has acquired the CUDA handle.

Method 1: The NVIDIA-SMI Check

Execute nvidia-smi inside the container. This confirms the runtime hook successfully mounted the drivers.

docker exec -it ollama nvidia-smi

Success Indicator: You see a formatted table listing your GPU name, VRAM usage, and driver version. Failure Indicator: "command not found" or "no devices found."

Method 2: Inspecting Ollama Logs

Ollama logs the device selection process during model initialization. Watch the logs while triggering an inference request.

  1. Keep a log tail open:
    docker logs -f ollama
    
  2. In a separate terminal, trigger a model (e.g., Llama 3):
    curl -X POST http://localhost:11434/api/generate -d '{
      "model": "llama3",
      "prompt": "Why is the sky blue?"
    }'
    
  3. Look at the logs. You want to see lines referencing VRAM or Compute Capability.
    • Good: source=gpu buffer_size=... compute_capability=...
    • Bad: driver not foundfalling back to cpu, or BLAS references without CUDA.

Common Pitfalls and Edge Cases

1. The "Could not select device driver" Error

If you see Error response from daemon: could not select device driver "" with capabilities: [[gpu]], your Docker daemon is not configured to look for the NVIDIA runtime.

Fix: You skipped the sudo nvidia-ctk runtime configure --runtime=docker step or forgot to restart the Docker daemon. Check /etc/docker/daemon.json. It should contain:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

2. WSL2 Memory Starvation

In WSL2, the Linux VM only gets 50% of your total system RAM by default. If you load a massive model (like Mixtral 8x7b) and it overflows VRAM, it will attempt to offload layers to system RAM. If that is also full, the process gets killed.

Fix: Create a .wslconfig file in your Windows User directory (C:\Users\YourUser\.wslconfig):

[wsl2]
memory=32GB  # Adjust based on your total RAM
swap=8GB

Restart WSL via PowerShell: wsl --shutdown.

3. Permission Denied on Device Nodes

In rare hardened Linux environments (SELinux/AppArmor), the container might lack permissions to access /dev/nvidia0.

Fix: Ensure the user running the Docker daemon is part of the video or render groups, though the --gpus all flag usually handles cgroup permissions automatically.

Conclusion

Running Ollama on CPU when a GPU is available is a waste of resources and developer time. By properly installing the NVIDIA Container Toolkit and utilizing the deploy.resources block in Docker Compose, you ensure that the hardware abstraction layer remains transparent.

Your tokens per second should now reflect the true capability of your hardware. If you are seeing speeds jump from 2 t/s to 50+ t/s, you have successfully bridged the container isolation gap.