Programming Tutorials

Posts

Showing posts with the label OpenVINO

How to Run Local LLMs on Intel Core Ultra NPUs using OpenVINO

Developers testing local language models on new Intel Core Ultra processors frequently encounter a frustrating hardware bottleneck. Despite having a dedicated Neural Processing Unit (NPU) designed specifically for AI workloads, standard HuggingFace or PyTorch pipelines default entirely to the CPU. The result is an overloaded processor, high power consumption, and inference speeds crawling at single-digit tokens per second, while the NPU registers 0% utilization in the task manager. To achieve performant local LLM inference on Intel silicon, developers must bypass generic PyTorch execution graphs and utilize Intel's OpenVINO toolkit. This guide provides a complete, production-ready implementation for routing LLM workloads directly to the NPU. Understanding the Execution Graph Bottleneck Standard machine learning frameworks like PyTorch execute operations through backends like ATen (for CPUs) or CUDA (for Nvidia GPUs). They do not natively understand the architecture of an Intel NP...