Solving 'OSError: We couldn't connect to huggingface.co' in Offline Mode

Nothing stops a production deployment faster than an unexpected network call in an environment designed to be isolated. You have carefully containerized your machine learning inference service, verified the model files are inside the Docker image, and deployed it to an air-gapped Kubernetes cluster.

Yet, upon startup, the application crashes with the dreaded error:

OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like <model_id> is not the path to a directory containing a file named config.json.

This error is misleading. Often, the files are there, but the library is prioritizing a network handshake over the local filesystem. This guide covers the root cause of this behavior in the Hugging Face transformers library and provides the production-grade configuration to enforce offline execution.

Root Cause Analysis: Why `from_pretrained` Pings the Internet

To solve this permanently, you must understand the resolution logic inside the huggingface_hub library (which transformers relies on).

When you call AutoModel.from_pretrained("bert-base-uncased"), the library performs the following decision tree:

Identifier Resolution: It checks if the string passed is a local path. If a folder named bert-base-uncased does not exist locally, it assumes the string is a Model ID on the Hub.
ETag Verification: By default, the library attempts to send a HEAD request to Hugging Face servers. It does this to retrieve the commit hash (ETag) of the model files to ensure your local cache is up to date.
Fallback vs. Failure:
- Soft Failure: If the internet is reachable but the hub is down, it might fall back to the cache.
- Hard Failure: If the DNS resolution fails or the connection is actively refused by a corporate firewall (common in Enterprise VPCs), the request raises a ConnectionError or OSError before the library attempts to look in the cache.

The error occurs because the library is "fail-secure" by default—it prioritizes consistency (getting the latest weights) over availability (using what you have), unless explicitly told otherwise.

The Fix: Forcing Offline Mode via Environment Variables

The most robust solution for containerized environments (Docker/Kubernetes) is using environment variables. This requires no code changes to your Python application and ensures the behavior is consistent across infrastructure.

Hugging Face provides a dedicated environment variable to toggle this behavior: HF_HUB_OFFLINE.

1. Dockerfile Implementation

In your Dockerfile, set this variable to 1. This instructs the library to skip the ETag check entirely and only look at local files.

# Dockerfile
FROM python:3.11-slim

# Prevent Python from writing pyc files to disc
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# CRITICAL: Tell Hugging Face to run in offline mode
ENV HF_HUB_OFFLINE=1

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy your application code
COPY . .

# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

2. Kubernetes Deployment Manifest

If you cannot modify the Docker image, you can inject this variable at runtime in your Kubernetes deployment.yaml.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-service
spec:
  template:
    spec:
      containers:
        - name: inference-container
          image: my-registry/inference-service:v1
          env:
            - name: HF_HUB_OFFLINE
              value: "1"
            # Optional: If you are using Datasets library as well
            - name: HF_DATASETS_OFFLINE
              value: "1"

The Fix: Enforcing Offline Mode in Python Code

If you prefer to handle this at the application level—for example, to allow a "try online, fall back to offline" strategy—you can use the local_files_only parameter.

This approach is useful during development but is generally less strict than the environment variable approach for production security compliance.

import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def load_model_offline(model_path: str):
    """
    Loads a model strictly from local files.
    """
    try:
        # local_files_only=True prevents the network call
        tokenizer = AutoTokenizer.from_pretrained(
            model_path, 
            local_files_only=True
        )
        
        model = AutoModelForSequenceClassification.from_pretrained(
            model_path, 
            local_files_only=True
        )
        
        return tokenizer, model
        
    except OSError as e:
        print(f"CRITICAL: Could not find model at {model_path}.")
        print(f"Ensure the directory contains config.json and pytorch_model.bin")
        raise e

# Usage
# Note: 'model_path' should be a directory, not a repo ID like 'bert-base-uncased'
path_to_model = "./models/bert-finetuned"
tokenizer, model = load_model_offline(path_to_model)

Deep Dive: Correctly Baking Models into Docker Images

Setting HF_HUB_OFFLINE=1 is only half the battle. The other half is ensuring the files actually exist in the image.

A common mistake is assuming the standard Hugging Face cache (~/.cache/huggingface) is portable. It is not. The cache uses symlinks and hash-based filenames that are difficult to manage manually.

The "Save Pretrained" Pattern

Instead of copying the cache folder, you should explicitly download and save the model to a dedicated directory during your build process (or in a CI/CD pipeline).

Create a script named download_model.py:

import os
from transformers import AutoModel, AutoTokenizer

MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
OUTPUT_DIR = "./local_model"

def download_and_save():
    print(f"Downloading {MODEL_NAME}...")
    
    # Download weights and config
    model = AutoModel.from_pretrained(MODEL_NAME)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
    # Save to a clean, non-linked directory structure
    print(f"Saving to {OUTPUT_DIR}...")
    model.save_pretrained(OUTPUT_DIR)
    tokenizer.save_pretrained(OUTPUT_DIR)
    
    print("Download complete.")

if __name__ == "__main__":
    download_and_save()

Multi-Stage Docker Build for Model Baking

Use a multi-stage Docker build to keep your final image size optimized. Download the model in a build stage, then copy the clean directory to the runner stage.

# Stage 1: Builder
FROM python:3.11-slim as builder

WORKDIR /build
RUN pip install transformers torch

COPY download_model.py .
# This step requires internet access during the build process
RUN python download_model.py

# Stage 2: Runtime
FROM python:3.11-slim

WORKDIR /app

# Install runtime deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the model from the builder stage
COPY --from=builder /build/local_model /app/model

# Set offline mode
ENV HF_HUB_OFFLINE=1
ENV MODEL_PATH="/app/model"

COPY main.py .

CMD ["python", "main.py"]

Common Pitfalls and Edge Cases

1. The Tokenizer Mismatch

Developers often download the model weights but forget the tokenizer files (tokenizer.json, vocab.txt). If AutoTokenizer.from_pretrained cannot find local files, it will trigger the network call even if the model weights are present. Always call save_pretrained on both the model and the tokenizer.

2. Dynamically Loading Code (`trust_remote_code=True`)

Some architectures (like Falcon or certain Llama variants) utilize custom Python code hosted on the Hub. If your model requires trust_remote_code=True, save_pretrained will save the weights, but it might not pull down the remote python files correctly in all versions of transformers.

Verify your local directory contains the .py modeling files (e.g., modeling_falcon.py) if you are using a custom architecture.

3. Absolute vs. Relative Paths

When running inside Docker, relative paths can be ambiguous depending on the WORKDIR. Always use absolute paths or verify your WORKDIR. If you set MODEL_PATH="./model" but your app runs from /app/src, it will look in /app/src/model, not /app/model.

Conclusion

The OSError connection failure in offline environments is a feature, not a bug, of the Hugging Face hub resolution logic. It prioritizes model freshness.

To stabilize your enterprise deployments, you must invert this priority. Use HF_HUB_OFFLINE=1 to disable the network check globally, and utilize the save_pretrained pattern to create a portable, self-contained directory of model artifacts that can be safely baked into your container images. This ensures your ML services remain resilient, regardless of firewall rules or internet connectivity.

Programming Tutorials

Search This Blog