Fixing Common Llama 3 Fine-Tuning Errors: CUDA OOM, Double BOS, and NaN Loss

You just pulled the Llama 3 8B weights. You have a respectable GPU rig—maybe an RTX 4090 or an A100—and a clean dataset. You fire up your training script, expecting a smooth QLoRA run. Instead, you're hit with a CUDA out of memory error before the first epoch completes, or worse, your training loss suddenly creates a NaN crater.

Even if training succeeds, inference might output repetitive gibberish due to invisible tokenizer conflicts.

Llama 3 is a significant architectural step up from Llama 2, but it introduces specific sensitivities regarding tokenization and numerical stability. This guide details exactly how to resolve the three most common blockers in modern fine-tuning pipelines using Unsloth, PyTorch, and QLoRA.

1. Solving "Phantom" CUDA OOM Errors

Many engineers encounter OOM (Out of Memory) errors even when the calculated model size suggests they have plenty of VRAM headroom. This is rarely a capacity issue; it is usually an allocation efficiency issue.

The Root Cause: Memory Fragmentation and Activation Overheads

Standard Hugging Face AutoModel loading, even with bitsandbytes, often fragments GPU memory during the initial sharding process. Furthermore, the standard PyTorch implementation of Flash Attention 2 or SDPA (Scaled Dot Product Attention) can spike memory usage during the backward pass if gradient checkpointing isn't strictly optimized.

The overhead isn't the weights (which are small in 4-bit); it's the optimizer states and gradients stored in full precision (float32), coupled with the activations stored for backpropagation.

The Fix: Unsloth Optimized Loading & Gradient Accumulation

We use Unsloth's FastLanguageModel loader. It rewrites the RoPE (Rotary Positional Embeddings) and MLP kernels to reduce peak memory usage by up to 30% compared to standard HF implementations.

We must also balance batch_size against gradient_accumulation_steps.

Implementation:

import torch
from unsloth import FastLanguageModel

# Configuration
max_seq_length = 2048 # Llama 3 supports 8k, but 2k fits 4090s comfortably
dtype = None # Auto-detects (Float16 for Tesla T4, Bfloat16 for Ampere+)
load_in_4bit = True # Essential for QLoRA

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # Fixes fragmentation during loading
    device_map = "auto", 
)

# Apply LoRA adapters with memory-efficient settings
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Set to 0 for Unsloth optimization
    bias = "none",
    use_gradient_checkpointing = "unsloth", # CRITICAL: Uses specialized kernels
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Why this works: Setting use_gradient_checkpointing="unsloth" offloads activation re-computation to optimized kernels rather than generic PyTorch hooks. If you still OOM, halve your per_device_train_batch_size and double your gradient_accumulation_steps in your training arguments.

2. The Double BOS Token Warning (Inference Degradation)

A common warning during Llama 3 fine-tuning looks like this: Token indices sequence length is longer than the specified maximum sequence length... followed by warnings about special tokens.

If ignored, your model learns that every sentence starts with <s><s> (two Start-Of-Sequence tokens). During inference, this shifts the distribution, causing the model to output repetition or stop prematurely.

The Root Cause: Template Conflicts

Llama 3's tokenizer behavior differs from Llama 2. Many chat templates (like ChatML or Alpaca) manually insert a BOS token. However, the Llama 3 tokenizer often defaults to add_bos_token=True.

When you tokenize your dataset using a chat template, you inadvertently add a BOS token via the string formatting, and the tokenizer prepends another one.

The Fix: Formatting Check and Tokenizer Configuration

You must align the tokenizer settings with your prompt formatter. The safest approach for Llama 3 is to trust the prompt template but disable the tokenizer's automatic BOS injection.

from unsloth.chat_templates import get_chat_template

# 1. Setup the tokenizer with Unsloth's optimized template handler
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Explicitly use Llama 3 template
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # Map sharegpt/other formats
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return { "text" : texts }

# 2. VALIDATION: Check a single example before training
# If this prints "<s> <s>", you have a problem.
debug_text = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Hello"}], 
    tokenize=False, 
    add_generation_prompt=False
)

# Llama 3 specific hack:
# Ensure the pad token is distinct to avoid early stopping issues
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    
print(f"Formatted Text Check: {debug_text}")
# Expected output should start with ONE header, e.g., "<|begin_of_text|><|start_header_id|>..."

Why this works: apply_chat_template(tokenize=False) allows us to see the raw string. By inspecting the output, we ensure the special tokens are valid. Llama 3 uses <|begin_of_text|> rather than the traditional <s>. If you treat Llama 3 like Llama 2, you break the context window.

3. Sudden NaN Loss Spikes

You are three hours into training. The loss curve looks beautiful, decreasing steadily. Suddenly, it hits NaN (Not a Number) and never recovers.

The Root Cause: Bfloat16 vs Float16 Stability

Llama 3 was pre-trained using Bfloat16 (BF16). Standard Float16 (FP16) has a smaller dynamic range. During backpropagation, gradients in deep networks like Llama 3 can become very small (underflow) or very large (overflow).

When using QLoRA (4-bit quantization), the de-quantization step adds noise. If you combine this with a high learning rate and the limited range of FP16, gradients explode to Infinity, which PyTorch renders as NaN.

The Fix: Enforce BF16 and Gradient Clipping

If you are on Ampere hardware (RTX 3000/4000 series, A100, H100), you must use BF16. If you are on older hardware (T4, V100), you must use aggressive gradient clipping.

Implementation:

from trl import SFTTrainer
from transformers import TrainingArguments

# Detect hardware capabilities
is_bfloat16_supported = torch.cuda.is_bf16_supported()

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can speed up training but adds OOM risk on smaller cards
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        
        # STABILITY SETTINGS
        fp16 = not is_bfloat16_supported,
        bf16 = is_bfloat16_supported,
        
        # Critical for preventing NaNs
        logging_steps = 1,
        optim = "adamw_8bit", # Reduces memory, maintains stability
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        
        # Clip gradients to prevent explosion in 4-bit space
        max_grad_norm = 1.0, 
    ),
)

Why this works:

bf16 = True: Bfloat16 has the same exponent size as Float32, making it virtually immune to the overflow issues that plague Float16.
max_grad_norm = 1.0: This clips the gradient vector. If the gradients try to spike (which leads to NaN), PyTorch scales them down to a magnitude of 1.0 before updating the weights.

Edge Case: The "Loss 0.0" Anomaly

If your loss instantly goes to 0.0, your labels are likely incorrect.

In QLoRA supervised fine-tuning, the model calculates loss based on the difference between the predicted next token and the actual next token. If you are training on the "User" prompt instead of the "Assistant" response, or if your masking logic is flawed, the model might find a trivial solution or fail to compute loss entirely.

Ensure DataCollatorForCompletionOnlyLM is used if you are using raw Hugging Face SFTTrainer, or rely on Unsloth's apply_chat_template which handles masking of the user prompts automatically.

Summary

Fine-tuning Llama 3 is a delicate balance of memory management and numerical precision. By using Unsloth's optimized kernels to solve OOMs, auditing your chat templates to prevent Double BOS tokens, and enforcing Bfloat16 to prevent NaN divergence, you can create robust, production-grade models.

These configurations turn a fragile training script into a reliable pipeline capable of producing high-performance adapters for your specific domain.

Programming Tutorials

Search This Blog