Programming Tutorials

Posts

Showing posts with the label QLoRA

How to Fix 'CUDA out of memory' When Fine-Tuning Llama 3

You have prepared your dataset, configured your environment, and loaded the Llama 3 weights. You initiate the SFTTrainer , look away for a moment, and return to find the dreaded RuntimeError: CUDA out of memory . This is the most common bottleneck in LLM engineering today. Even developers with NVIDIA RTX 4090s (24GB VRAM) or A100s encounter this when attempting to fine-tune Llama 3 8B, let alone the 70B variant. The issue is rarely the raw size of the model weights. The problem lies in the training overhead—gradients, optimizer states, and activation maps—which can balloon memory usage to 4x or 5x the model size. This guide provides a rigorous, architectural approach to solving OOM errors using PyTorch, QLoRA, and the latest Hugging Face ecosystem. The Anatomy of an OOM Error To fix the memory leak, you must understand where the VRAM is going. When you load Llama 3 8B in standard FP16 (16-bit floating point), the math looks like this: Model Weights: ~15GB (8 b...

Fixing Common Llama 3 Fine-Tuning Errors: CUDA OOM, Double BOS, and NaN Loss

You just pulled the Llama 3 8B weights. You have a respectable GPU rig—maybe an RTX 4090 or an A100—and a clean dataset. You fire up your training script, expecting a smooth QLoRA run. Instead, you're hit with a CUDA out of memory error before the first epoch completes, or worse, your training loss suddenly creates a NaN crater. Even if training succeeds, inference might output repetitive gibberish due to invisible tokenizer conflicts. Llama 3 is a significant architectural step up from Llama 2, but it introduces specific sensitivities regarding tokenization and numerical stability. This guide details exactly how to resolve the three most common blockers in modern fine-tuning pipelines using Unsloth, PyTorch, and QLoRA. 1. Solving "Phantom" CUDA OOM Errors Many engineers encounter OOM (Out of Memory) errors even when the calculated model size suggests they have plenty of VRAM headroom. This is rarely a capacity issue; it is usually an allocation effici...