Skip to main content

How to Run DeepSeek R1 Locally in VS Code for Free (Privacy-First Copilot)

 The era of relying exclusively on paid, cloud-hosted AI coding assistants is ending. While services like GitHub Copilot and Cursor are powerful, they come with two significant downsides: monthly subscription costs and the inherent privacy risk of sending proprietary codebase data to third-party servers.

For Principal Engineers and privacy-conscious developers, the solution lies in Local Inference. By running high-performance open-weight models like DeepSeek R1 on your own hardware, you gain total data sovereignty and zero latency networking, all without a credit card.

This guide details the exact technical implementation of a local AI stack using OllamaDeepSeek R1, and VS Code.

The Architecture: Why Local Inference Matters

Before executing the setup, it is vital to understand the architectural shift. Cloud-based assistants operate via REST API calls. Every time you trigger a completion, your IDE packages the current file and cursor context, encrypts it, sends it to an external data center, waits for inference, and receives the tokens back.

This introduces latency and data egress.

Local inference moves the Large Language Model (LLM) execution to your machine's loopback interface (localhost). We will use Ollama as the inference engine. Ollama acts as a backend service that loads the model weights (GGUF format) into your RAM or VRAM and exposes an OpenAI-compatible API endpoint (usually port 11434).

Your IDE then treats this local endpoint exactly as it would a cloud API, eliminating network latency entirely.

Prerequisites

To run modern distilled reasoning models like DeepSeek R1 effectively, your hardware needs to meet specific thresholds.

  1. OS: macOS (Apple Silicon recommended), Linux, or Windows (with WSL2).
  2. RAM/VRAM:
    • Minimum: 8GB RAM (for 1.5b or 7b parameter models).
    • Recommended: 16GB+ RAM (or 8GB+ dedicated VRAM) for 8b/14b models.
  3. Software:
    • Visual Studio Code (Latest).
    • Ollama (Latest release).

Step 1: Deploying the Inference Engine (Ollama)

Ollama abstracts the complexity of llama.cpp and GPU driver configuration. It detects your hardware acceleration (Metal, CUDA, or ROCm) automatically.

Installation

Download Ollama from the official site or install via terminal:

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from the official Ollama website.

Pulling DeepSeek R1

DeepSeek R1 is a reasoning model. For local use, we usually run the distilled versions (based on Llama or Qwen architectures) which offer incredible performance-per-watt. We will use the 8-billion parameter version, which strikes the best balance for most developer laptops.

Run this command in your terminal:

ollama run deepseek-r1:8b

Once the download finishes, the model will load. You can test it immediately in the terminal to ensure inference is working:

>>> Write a binary search algorithm in TypeScript.

If the text streams back, your backend is operational. Type /bye to exit the chat, but keep the Ollama background service running.

Step 2: The VS Code Integration Layer

To connect VS Code to Ollama, we need an extension that supports custom API endpoints. The industry standard for open-source AI integration is Continue.

  1. Open VS Code.
  2. Navigate to the Extensions Marketplace (Ctrl+Shift+X).
  3. Search for and install Continue (published by generic-named "Continue").
    • Note: Avoid "CodeGPT" for this specific setup; Continue offers granular control over context providers and configuration JSON.

Step 3: configuring the Connection

This is the step where most implementations fail. We must explicitly tell VS Code to route prompts to your local Ollama instance rather than OpenAI or Anthropic.

  1. Click the Continue icon in the VS Code sidebar.
  2. Click the Gear Icon (Settings) to open config.json.

Replace the models and tabAutocompleteModel sections with the following configuration. This setup optimizes for the DeepSeek R1 architecture.

{
  "models": [
    {
      "title": "DeepSeek R1 (8b)",
      "provider": "ollama",
      "model": "deepseek-r1:8b",
      "apiBase": "http://localhost:11434",
      "systemMessage": "You are an expert Principal Software Engineer. You write concise, performant, and modern code. You prefer TypeScript, functional patterns, and strict typing."
    }
  ],
  "tabAutocompleteModel": {
    "title": "DeepSeek R1 Autocomplete",
    "provider": "ollama",
    "model": "deepseek-r1:8b", 
    "apiBase": "http://localhost:11434"
  },
  "allowAnonymousTelemetry": false
}

Configuration Deep Dive

  • provider: Tells the extension to format requests for the Ollama API structure.
  • model: Must match the tag you pulled in Step 1 exactly.
  • apiBase: The local loopback address. If you are running Ollama in WSL2 but VS Code in Windows, this might remain localhost or require the WSL IP address depending on your networking mode.
  • tabAutocompleteModel: We are reusing R1 for autocomplete here. Note: For faster autocomplete on older hardware, consider pulling qwen2.5-coder:1.5b and using that model specifically for this block.

Step 4: Verification and Usage

Save the config.json file. The extension usually reloads the config automatically.

Open a chat window in Continue (Cmd+L / Ctrl+L) and select DeepSeek R1 (8b) from the dropdown.

Test Prompt

Let's validate the model's knowledge of modern React patterns (Server Components).

Prompt: "Create a Next.js 14 Server Component that fetches data from an API and renders a grid of cards using Tailwind CSS."

Expected Output (DeepSeek R1):

// app/products/page.tsx
import { Product } from "@/types";

async function getProducts(): Promise<Product[]> {
  const res = await fetch('https://api.example.com/products', {
    cache: 'no-store' // dynamic data fetching
  });
  
  if (!res.ok) throw new Error('Failed to fetch products');
  
  return res.json();
}

export default async function ProductsPage() {
  const products = await getProducts();

  return (
    <div className="container mx-auto p-4">
      <h1 className="text-2xl font-bold mb-6">Latest Products</h1>
      <div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6">
        {products.map((product) => (
          <div 
            key={product.id} 
            className="border rounded-lg p-4 shadow-sm hover:shadow-md transition-shadow"
          >
            <h2 className="text-xl font-semibold">{product.name}</h2>
            <p className="text-gray-600 mt-2">{product.description}</p>
            <div className="mt-4 font-mono text-blue-600">
              ${product.price.toFixed(2)}
            </div>
          </div>
        ))}
      </div>
    </div>
  );
}

Notice that R1 correctly utilizes async/await directly in the component (valid in RSCs) and uses standard Tailwind utility classes.

Optimization: Avoiding the "Cold Boot" Lag

One common frustration with local LLMs is the delay on the first request. This happens because the model must be loaded from disk into VRAM.

To fix this, you can force the model to stay resident in memory.

  1. Open your terminal.
  2. Run the following curl command to load the model with a defined keep-alive window (e.g., 60 minutes):
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:8b",
  "keep_alive": "60m"
}'

This ensures that while you are coding, the model remains hot, providing near-instant responses.

Common Pitfalls and Edge Cases

1. The Context Window Limit

DeepSeek R1 generally supports large context windows, but Ollama defaults to a conservative limit (often 2048 or 4096 tokens) to save memory. If the model starts "forgetting" code you just pasted, you need to increase the context context.

In your config.json within the options object:

"options": {
  "num_ctx": 8192
}

Warning: Increasing context size significantly increases RAM usage.

2. High CPU Usage

If you notice your fan spinning loudly and responses are slow (1-2 tokens per second), Ollama is likely falling back to CPU inference. This usually means the model is too large for your GPU VRAM.

The Fix: Downgrade to a smaller quantization or parameter count.

ollama run deepseek-r1:7b-q4_K_M

The q4_K_M tag indicates 4-bit quantization, which reduces memory usage by nearly 50% with minimal loss in reasoning capability.

Conclusion

By coupling DeepSeek R1 with Ollama and VS Code, you have built a development environment that is private by default, cost-effective, and highly capable.

You are no longer dependent on an internet connection to generate code, and your intellectual property never leaves your local machine. As open-weight models continue to close the gap with proprietary giants, the local-first AI stack is becoming the preferred choice for serious engineering teams.