How to Run Local LLMs on Intel Core Ultra NPUs using OpenVINO

Developers testing local language models on new Intel Core Ultra processors frequently encounter a frustrating hardware bottleneck. Despite having a dedicated Neural Processing Unit (NPU) designed specifically for AI workloads, standard HuggingFace or PyTorch pipelines default entirely to the CPU. The result is an overloaded processor, high power consumption, and inference speeds crawling at single-digit tokens per second, while the NPU registers 0% utilization in the task manager.

To achieve performant local LLM inference on Intel silicon, developers must bypass generic PyTorch execution graphs and utilize Intel's OpenVINO toolkit. This guide provides a complete, production-ready implementation for routing LLM workloads directly to the NPU.

Understanding the Execution Graph Bottleneck

Standard machine learning frameworks like PyTorch execute operations through backends like ATen (for CPUs) or CUDA (for Nvidia GPUs). They do not natively understand the architecture of an Intel NPU.

The NPU is a specialized accelerator composed of neural compute engines, MAC (Multiply-Accumulate) arrays, and dedicated DSPs (Digital Signal Processors). To execute a model on this hardware, the dynamic neural network graph must be compiled into a static Intermediate Representation (IR), and then mapped to the NPU's specific instruction set.

Without an abstraction layer to handle this compilation, PyTorch ignores the NPU entirely. By introducing OpenVINO (Open Visual Inference and Neural network Optimization) via the optimum-intel library, we can translate standard Transformers models into OpenVINO IR, compress the weights to fit memory bandwidth constraints, and target the NPU directly.

The Solution: OpenVINO and INT4 Weight Compression

Running an LLM on an NPU requires two critical steps: exporting the model to OpenVINO IR format and applying INT4 weight compression. LLMs are notoriously memory-bandwidth bound. Compressing the model weights to 4-bit integers reduces the memory footprint by up to 80%, allowing the NPU to process tokens efficiently without saturating system RAM.

Prerequisites for AI PC Development

Ensure your environment is configured for modern AI PC development. You must have the latest Intel NPU driver installed at the OS level (via Windows Update or Intel's driver portal). Next, install the required Python packages:

pip install optimum[openvino] nncf transformers accelerate

Step 1: Exporting and Quantizing the Model

This script downloads a standard HuggingFace model, applies INT4 quantization via NNCF (Neural Network Compression Framework), and exports it to the OpenVINO IR format (.xml and .bin).

import os
from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

def export_and_quantize_model():
    model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    export_dir = "tinyllama-openvino-int4"
    
    # Configure 4-bit weight compression
    # group_size=128 isolates quantization error to small blocks of weights
    quant_config = OVWeightQuantizationConfig(
        bits=4,
        sym=False,
        group_size=128,
        ratio=0.8
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    # The export=True flag triggers the translation from PyTorch to OpenVINO IR
    print(f"Exporting {model_id} to OpenVINO IR. This requires high RAM and takes a few minutes...")
    model = OVModelForCausalLM.from_pretrained(
        model_id,
        export=True,
        quantization_config=quant_config,
        compile=False # We delay compilation until inference time
    )
    
    # Save the compressed model locally
    model.save_pretrained(export_dir)
    tokenizer.save_pretrained(export_dir)
    print(f"Model successfully exported to ./{export_dir}")

if __name__ == "__main__":
    export_and_quantize_model()

Step 2: Intel NPU Python Inference Implementation

Once the model is exported, we load the IR files and explicitly set the target device to "NPU". We also configure model caching. The NPU requires JIT (Just-In-Time) compilation of the graph on the first run. Caching saves the compiled hardware instructions to disk, drastically reducing subsequent loading times.

from transformers import AutoTokenizer, pipeline
from optimum.intel import OVModelForCausalLM

def run_npu_inference():
    model_dir = "tinyllama-openvino-int4"
    
    # OpenVINO specific configurations
    ov_config = {
        "PERFORMANCE_HINT": "LATENCY",
        "CACHE_DIR": "./npu_cache" # Prevents long compilation times on subsequent runs
    }
    
    print("Loading compiled model into Intel NPU...")
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    # Target the NPU and pass the OpenVINO configuration
    model = OVModelForCausalLM.from_pretrained(
        model_dir,
        device="NPU",
        ov_config=ov_config,
        compile=True
    )
    
    # Utilize standard HuggingFace pipeline syntax
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True
    )
    
    prompt = "<|system|>\nYou are a helpful AI assistant.\n<|user|>\nExplain how a Neural Processing Unit works.\n<|assistant|>\n"
    
    print("Generating response...")
    result = pipe(prompt)
    
    print("\n--- Response ---")
    print(result[0]['generated_text'].split("<|assistant|>\n")[1])

if __name__ == "__main__":
    run_npu_inference()

Deep Dive: Core Ultra NPU Programming Architecture

When the Python script executes device="NPU", a sequence of hardware-specific operations occurs beneath the HuggingFace API.

The OVModelForCausalLM class completely replaces standard PyTorch nn.Module forward passes. Instead, it utilizes the OpenVINO Runtime Engine. The runtime reads the .xml topology file and the .bin weight file. Because we specified the NPU plugin, OpenVINO interfaces with the Intel Compute Runtime driver (via Level Zero or OpenCL APIs, depending on the OS stack).

During compilation, OpenVINO analyzes the Transformer architecture—specifically the Attention mechanisms and Feed-Forward Networks (FFNs). It maps dense matrix multiplications to the NPU's MAC arrays. Because we applied INT4 compression via OVWeightQuantizationConfig, the weights are shuttled across the PCIe/internal fabric at a fraction of the bandwidth required for FP32/FP16 models. The weights are then up-cast to FP16 in the NPU's local registers just before the mathematical operation occurs, preserving acceptable accuracy while maximizing throughput.

Handling Common NPU Pitfalls and Edge Cases

1. Unsupported Operations and Device Fallback

While the NPU is highly optimized for standard Transformer operations, certain experimental model architectures may contain custom layers not yet supported by the NPU compiler. If you encounter an "Unsupported Operation" error, change your device target to "AUTO:NPU,CPU".

model = OVModelForCausalLM.from_pretrained(
    model_dir,
    device="AUTO:NPU,CPU", # Enables heterogeneous execution
    ov_config=ov_config
)

This enables heterogeneous execution. OpenVINO will aggressively schedule all supported layers to the NPU while seamlessly falling back to the CPU for any unsupported operations, preventing pipeline crashes.

2. The First-Token Latency Spike

If you omit the "CACHE_DIR" property in the ov_config, you will experience severe first-token latency (often taking several minutes) every time the Python script is launched. This is because the graph compiler must re-translate the IR to native hardware assembly. Always define a local cache directory for production applications.

3. Dynamic Shapes Limitation

Older versions of OpenVINO required static input shapes for NPU execution, meaning prompts had to be heavily padded to a fixed length. Modern optimum-intel (v1.14+) combined with OpenVINO 2024.1+ handles dynamic shapes automatically. If you encounter shape-related runtime errors, ensure your pip packages are strictly up to date, as the NPU compiler stack receives aggressive monthly optimization patches.

By integrating this Intel OpenVINO tutorial into your workflows, you transition from generic PyTorch CPU bound executions to hardware-accelerated pipelines. This exact architectural pattern is the current industry standard for optimizing power consumption and throughput in modern AI applications running locally on edge devices.

Programming Tutorials

Search This Blog