Skip to main content

Increasing Ollama's Default Context Window: Stop the AI from Forgetting

 You have orchestrated a complex Retrieval-Augmented Generation (RAG) pipeline. Your vector database accurately fetches the relevant documents, and your Python application cleanly formats them into a comprehensive prompt. Yet, when the LLM generates a response, it hallucinates details or entirely ignores the instructions provided at the beginning of the prompt.

This silent failure is a well-known hurdle for LLM application developers. The root cause is rarely the prompt engineering or the retrieval mechanism. Instead, it is the strict default Ollama context window limit.

The Root Cause: Why Ollama Silently Truncates Memory

Ollama is designed to run seamlessly on consumer hardware, prioritizing high compatibility and avoiding Out-Of-Memory (OOM) crashes. To achieve this, Ollama imposes a hard default context window of 2048 tokens on nearly all models, regardless of the base model's actual theoretical maximum.

When your prompt, system instructions, and RAG context exceed this 2048-token limit, Ollama does not throw an error. Instead, it implements a rolling window technique. It silently evicts the oldest tokens from the context block to make room for new ones.

In an Ollama RAG architecture, the system prompt and the retrieved documents are typically positioned at the very beginning of the input string. Consequently, these critical components are the first to be truncated when the limit is breached. To fix Ollama forgetting context, you must explicitly override this default allocation by adjusting the num_ctx parameter, instructing the backend to allocate sufficient VRAM for a larger Key-Value (KV) cache.

The Fix: Overriding the Context Window

There are two primary methods to increase the context window: dynamically at runtime via the API, or persistently by creating a custom model configuration.

Method 1: Dynamic Allocation via the Python API

For most RAG pipelines, controlling the context length dynamically within your application code is the preferred approach. This allows you to scale the Ollama API context length based on the specific requirements of the request.

If you are using the official ollama Python client, you must pass the num_ctx parameter inside the options dictionary.

import ollama

def query_large_context(prompt_text: str, context_text: str) -> str:
    # Combine context and prompt (simulating a RAG injection)
    full_prompt = f"Context: {context_text}\n\nTask: {prompt_text}"
    
    response = ollama.chat(
        model='llama3',
        messages=[
            {'role': 'system', 'content': 'You are a technical documentation assistant.'},
            {'role': 'user', 'content': full_prompt}
        ],
        # Explicitly increase the context window to 8192 tokens
        options={
            'num_ctx': 8192 
        }
    )
    
    return response['message']['content']

If your application utilizes LangChain to orchestrate the RAG pipeline, the configuration is passed directly during the instantiation of the ChatOllama object.

from langchain_community.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

# Initialize the model with an expanded context window
llm = ChatOllama(
    model="llama3",
    num_ctx=8192,
    temperature=0.1
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer strictly based on the provided context: {context}"),
    ("user", "{question}")
])

chain = prompt | llm

# The chain will now respect up to 8192 tokens before truncating
response = chain.invoke({
    "context": "long_retrieved_document_string_here",
    "question": "Summarize the architectural constraints."
})

Method 2: Persistent Allocation via Modelfile

If you are deploying a dedicated model endpoint and want to ensure the context window is always set to a specific size, you should bake the parameter into the model itself. This is done using an Ollama num_ctx Modelfile.

Create a text file named Modelfile (no extension) in your project directory:

# Inherit from your base model of choice
FROM llama3

# Set the context window to 8192 tokens
PARAMETER num_ctx 8192

# Optional: Add standard system prompts for your specific use case
SYSTEM """
You are an expert site reliability engineer. Always analyze the provided logs thoroughly before answering.
"""

Build the custom model via the Ollama CLI:

ollama create llama3-8k -f ./Modelfile

You can now call llama3-8k from your Python application, and it will inherently support the 8192-token context without requiring the options dictionary in the API request.

Deep Dive: How the Context Extension Works

When you increase num_ctx, you are directly modifying the memory allocation strategy of the underlying inference engine (usually llama.cpp under the hood of Ollama).

Large Language Models utilize a KV cache to store the attention representations of previously processed tokens. This prevents the model from having to recompute attention scores for the entire sequence every time it generates a new token. The size of this KV cache scales linearly with the sequence length.

By changing the default 2048 to 8192, Ollama pre-calculates the required VRAM for the expanded KV cache and reserves it on your GPU. If the requested context size requires more VRAM than your GPU has available, Ollama will automatically offload layers to your system's CPU RAM.

Common Pitfalls and Edge Cases

VRAM Exhaustion and Degradation of TPS

Increasing the context window is not a free operation. If your GPU has 8GB of VRAM and you set num_ctx to 32768 (32k tokens), the KV cache will likely exceed your physical VRAM. Ollama will gracefully fall back to system RAM, but your Tokens Per Second (TPS) generation speed will plummet due to the PCIe bottleneck. Calculate your memory budget carefully based on your hardware.

Exceeding the Base Model's Training Limit

Setting num_ctx to 128,000 will not work if the base model was only trained on a maximum of 8,192 tokens (e.g., standard Llama 3 8B). The model utilizes Rotary Position Embedding (RoPE) to understand the sequence of tokens. If you push the context beyond the RoPE scaling limits defined during the model's training phase, the model will output gibberish. Always verify the maximum context length of the specific model weights you are pulling.

The "Lost in the Middle" Phenomenon

Even with a successfully expanded context window, LLMs suffer from a documented degradation in recall accuracy for information placed in the middle of a large prompt. If you are passing 16,000 tokens of RAG context, ensure your retrieval system ranks the most critical documents at the very beginning or the very end of the injected context string.

Conclusion

The silent truncation caused by the default Ollama context window limit is a frequent source of debugging frustration. By explicitly defining the num_ctx parameter—either dynamically via the Python API options or persistently through a custom Modelfile—you ensure your models process the entirety of your retrieval data. Balancing this parameter against your hardware's VRAM capacity and the base model's trained limits is essential for building performant, reliable AI applications.