Skip to main content

Posts

Showing posts with the label PyTorch

How to Fix "CUDA Out of Memory" When Fine-Tuning Llama 3 with LoRA

  You have successfully downloaded the Llama 3 8B weights. You set up your PyTorch environment, configured a basic LoRA adapter, and launched the training script on your RTX 3090 or Google Colab T4. Then, before the first epoch even starts, you hit the wall:  RuntimeError: CUDA out of memory. Tried to allocate... This is the single most common barrier for engineers moving from using LLMs to fine-tuning them. It is frustrating because the math  seems  like it should work. If Llama 3 8B is roughly 15GB in half-precision (FP16), why does it crash a 24GB or even 40GB card? The answer lies in the hidden memory overhead of the training process itself. This guide provides a root cause analysis of VRAM consumption and a production-grade code solution using QLoRA (Quantized LoRA) to fit Llama 3 training pipelines onto consumer hardware. The Root Cause: Where Did the VRAM Go? To fix the OOM (Out Of Memory) error, you must understand what consumes GPU memory. It is not just the...

Fixing Llama 3.1 Fine-Tuning Errors: The Padding Token & `eot_id` Trap

  You have curated a high-quality instruction dataset. You have set up your QLoRA config. You launch   SFTTrainer , and within seconds, your training loop crashes with an   IndexError: index out of range , or worse, your loss flatlines at   0.0   or   NaN . This is the most common bottleneck engineers face when migrating from Llama 2 to Llama 3 or 3.1. The issue isn't your dataset quality; it is a fundamental misalignment between the Llama 3.1 tokenizer’s special tokens, the default padding behavior in Hugging Face’s  transformers  library, and how the model interprets "End of Turn" versus "End of Text." This guide details the root cause of these convergence failures and provides the production-grade code required to fix them. The Root Cause: Why Llama 3.1 Breaks Standard Pipelines The Llama 3 family introduced a massive vocabulary expansion (128k tokens) and a shift in special token usage. In older models (and Llama 2), the End of Sentence (EOS) ...