There are few things more frustrating in AI engineering than watching a powerful Llama 3 or Mistral model crawl at 0.5 tokens per second. You have an RTX 3090 or a hefty server GPU, yet your Dockerized Ollama instance insists on burning up your CPU cores instead.
If you are running Ollama inside a Docker container and it fails to detect your NVIDIA GPU, the issue is rarely with Ollama itself. The problem lies in the isolation layer between the Docker daemon and the host kernel’s graphics drivers.
This guide provides the architectural root cause and the specific, copy-paste configurations required to force GPU passthrough on both native Linux and WSL2 environments.
The Root Cause: Why Docker Isolates Your GPU
To fix the issue, you must understand the "gap" in the architecture. Docker containers share the host's OS kernel but maintain their own user space (filesystem, libraries, and binaries).
By default, a container acts as a clean slate. It does not have access to the host's PCI devices, nor does it have the proprietary NVIDIA driver libraries (libnvidia-container.so, libcuda.so) mapped into its memory space.
When Ollama initializes, it queries the system for accessible accelerators (specifically looking for CUDA endpoints). In a standard docker run environment without specific flags and runtime hooks, these endpoints simply do not exist. Ollama catches the exception and silently falls back to CPU inference.
To bridge this gap, we rely on the NVIDIA Container Toolkit. This acts as a wrapper around runc (the default container runtime). It utilizes a prestart hook to mount the necessary driver files and device nodes from the host into the container at runtime.
Prerequisites
Before configuring Docker, ensure your host environment is prepared.
For Native Linux (Ubuntu/Debian/CentOS)
You must have the proprietary NVIDIA drivers installed on the host OS.
# Check if drivers are active on the host
nvidia-smi
If this command fails, install the drivers via your package manager immediately. Docker cannot pass through what the host cannot see.
For WSL2 Users
Do not install NVIDIA drivers inside the WSL2 Linux distribution. WSL2 uses a unique architecture where the Linux kernel proxies calls to the Windows host. You only need the NVIDIA drivers installed on your Windows host system. The driver is projected into WSL2 automatically at /usr/lib/wsl/lib.
Step 1: Installing the NVIDIA Container Toolkit
This is the bridge that allows the container to request GPU access.
Native Linux Installation
If you are on Ubuntu or Debian, run the following commands to configure the repository and install the toolkit.
# 1. Configure the repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 2. Update and Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# 3. Configure the Docker daemon to use the toolkit
sudo nvidia-ctk runtime configure --runtime=docker
# 4. Restart Docker to apply changes
sudo systemctl restart docker
WSL2 Installation
If you are using Docker Desktop for Windows, this is pre-configured. Ensure "Use the WSL 2 based engine" is checked in settings.
If you are running the Docker engine natively inside a WSL2 distro (bypassing Docker Desktop), follow the Native Linux Installation steps above exactly. WSL2 is binary compatible with standard Ubuntu/Debian repositories.
Step 2: Running Ollama with GPU Flags
The most common failure point is the docker run command itself. You must explicitly request the GPU resource. The --gpus all flag instructs the Docker daemon to utilize the NVIDIA runtime hooks we just installed.
The Correct Docker Run Command
docker run -d \
--gpus=all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
If you need a specific GPU (for multi-gpu setups), replace --gpus=all with --gpus '"device=0"'.
Step 3: Deployment via Docker Compose
In a production DevOps environment, you should be using docker-compose.yml rather than imperative shell commands.
The syntax for GPU reservation changed in recent Compose file specifications. Do not use the legacy runtime: nvidia syntax. Use the deploy block resources reservation.
Here is a valid, modern compose.yaml for Ollama:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_storage:/root/.ollama
# The critical GPU configuration
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all # Or set to 1, 2, etc.
capabilities: [gpu]
restart: always
volumes:
ollama_storage:
Run this with:
docker compose up -d
Deep Dive: Verifying the Fix
Do not assume it is working just because the container started. We need to verify that Ollama has acquired the CUDA handle.
Method 1: The NVIDIA-SMI Check
Execute nvidia-smi inside the container. This confirms the runtime hook successfully mounted the drivers.
docker exec -it ollama nvidia-smi
Success Indicator: You see a formatted table listing your GPU name, VRAM usage, and driver version. Failure Indicator: "command not found" or "no devices found."
Method 2: Inspecting Ollama Logs
Ollama logs the device selection process during model initialization. Watch the logs while triggering an inference request.
- Keep a log tail open:
docker logs -f ollama - In a separate terminal, trigger a model (e.g., Llama 3):
curl -X POST http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "Why is the sky blue?" }' - Look at the logs. You want to see lines referencing VRAM or Compute Capability.
- Good:
source=gpubuffer_size=...compute_capability=... - Bad:
driver not found,falling back to cpu, orBLASreferences without CUDA.
- Good:
Common Pitfalls and Edge Cases
1. The "Could not select device driver" Error
If you see Error response from daemon: could not select device driver "" with capabilities: [[gpu]], your Docker daemon is not configured to look for the NVIDIA runtime.
Fix: You skipped the sudo nvidia-ctk runtime configure --runtime=docker step or forgot to restart the Docker daemon. Check /etc/docker/daemon.json. It should contain:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
2. WSL2 Memory Starvation
In WSL2, the Linux VM only gets 50% of your total system RAM by default. If you load a massive model (like Mixtral 8x7b) and it overflows VRAM, it will attempt to offload layers to system RAM. If that is also full, the process gets killed.
Fix: Create a .wslconfig file in your Windows User directory (C:\Users\YourUser\.wslconfig):
[wsl2]
memory=32GB # Adjust based on your total RAM
swap=8GB
Restart WSL via PowerShell: wsl --shutdown.
3. Permission Denied on Device Nodes
In rare hardened Linux environments (SELinux/AppArmor), the container might lack permissions to access /dev/nvidia0.
Fix: Ensure the user running the Docker daemon is part of the video or render groups, though the --gpus all flag usually handles cgroup permissions automatically.
Conclusion
Running Ollama on CPU when a GPU is available is a waste of resources and developer time. By properly installing the NVIDIA Container Toolkit and utilizing the deploy.resources block in Docker Compose, you ensure that the hardware abstraction layer remains transparent.
Your tokens per second should now reflect the true capability of your hardware. If you are seeing speeds jump from 2 t/s to 50+ t/s, you have successfully bridged the container isolation gap.