Fixing Llama 3.1 Fine-Tuning Errors: The Padding Token & `eot

Fixing Llama 3.1 Fine-Tuning Errors: The Padding Token & `eot_id` Trap

You have curated a high-quality instruction dataset. You have set up your QLoRA config. You launch SFTTrainer, and within seconds, your training loop crashes with an IndexError: index out of range, or worse, your loss flatlines at 0.0 or NaN.

This is the most common bottleneck engineers face when migrating from Llama 2 to Llama 3 or 3.1. The issue isn't your dataset quality; it is a fundamental misalignment between the Llama 3.1 tokenizer’s special tokens, the default padding behavior in Hugging Face’s transformers library, and how the model interprets "End of Turn" versus "End of Text."

This guide details the root cause of these convergence failures and provides the production-grade code required to fix them.

The Root Cause: Why Llama 3.1 Breaks Standard Pipelines

The Llama 3 family introduced a massive vocabulary expansion (128k tokens) and a shift in special token usage. In older models (and Llama 2), the End of Sentence (EOS) token acted as a catch-all for stopping generation and padding batches.

Llama 3.1 separates these concerns, creating a "trap" for standard fine-tuning scripts.

1. The Missing Pad Token

By default, the Llama 3.1 tokenizer configuration often leaves pad_token_id as None. When you use a data collator (like DataCollatorForSeq2Seq), it attempts to pad your sequences. If no pad token is defined, it may default to an index that exceeds the embedding matrix size or conflicts with existing weights, causing IndexError.

2. The `eot_id` vs. `end_of_text` Conflict

This is where semantic drift occurs.

<|end_of_text|> (ID 128001): The global end of the sequence. The model stops everything.
<|eot_id|> (ID 128009): The end of a turn (user or assistant). Used in Instruct/Chat models to hand control back to the user.

If you blindly set tokenizer.pad_token = tokenizer.eos_token (a common Llama 2 hack), you are likely padding with <|end_of_text|>. However, your instruct dataset uses chat templates that end with <|eot_id|>.

If the attention masks aren't calculated perfectly, the model penalizes the padding tokens, confusing the "stop generation" signal with the "ignore this blank space" signal. This leads to non-converging loss.

3. Padding Direction

Inference requires Left Padding (to keep the generated token at the growing edge). However, SFTTrainer and Flash Attention 2 implementations in PyTorch often assume or require Right Padding during training to align memory blocks efficiently. Mixing these up results in garbage output or training crashes.

The Fix: Correct Tokenizer Initialization

To solve this, we must explicitly configure the tokenizer to distinguish between padding, termination, and turn-ending. We will use PyTorch and Hugging Face Transformers.

Prerequisites

Ensure you are running the latest stable versions to support Llama 3.1 architectures.

pip install -U "transformers>=4.43.0" "torch>=2.4.0" "peft>=0.12.0" "bitsandbytes"

The Implementation

Here is the robust setup for initializing Llama 3.1 for fine-tuning. This code handles the padding assignment and fixes the pad_token_id logic.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configuration
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def get_tokenizer_and_model():
    # 1. Load Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

    # 2. FIX: Explicitly set the Pad Token
    # Llama 3.1 does not have a pad token set by default.
    # We map it to <|end_of_text|> (128001) to avoid vocabulary resizing issues.
    if tokenizer.pad_token is None:
        tokenizer.pad_token = "<|end_of_text|>"
        tokenizer.pad_token_id = 128001  # ID for <|end_of_text|>
    
    # 3. FIX: Set Padding Side
    # FP16/BF16 training with Flash Attention requires Right Padding.
    # Left Padding is strictly for inference.
    tokenizer.padding_side = 'right'

    # 4. Load Model with QLoRA (4-bit)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto",
        attn_implementation="flash_attention_2", # Mandatory for 3.1 speed
        torch_dtype=torch.bfloat16
    )

    # 5. FIX: Resize Embeddings? 
    # Since we reused an existing token for padding, we DO NOT resize.
    # Resizing embeddings invalidates the pre-trained weights at the tail
    # and causes instability if not handled carefully.
    
    # However, we must ensure the model config knows about the pad token
    model.config.pad_token_id = tokenizer.pad_token_id
    
    return tokenizer, model

# Verify the fix
if __name__ == "__main__":
    tokenizer, model = get_tokenizer_and_model()
    
    print(f"Pad Token: {tokenizer.pad_token}")
    print(f"Pad Token ID: {tokenizer.pad_token_id}")
    print(f"Padding Side: {tokenizer.padding_side}")
    
    # Test encoding
    test_text = "Hello, Llama 3.1!"
    encoded = tokenizer(test_text, padding='max_length', max_length=10, return_tensors='pt')
    
    print(f"Input IDs: {encoded['input_ids']}")
    # You should see 128001 appearing at the end (right side)

Deep Dive: Handling the Data Collator

The code above fixes the model and tokenizer, but the training loop can still fail if the DataCollator masks the labels incorrectly.

In Causal Language Modeling (CLM), we calculate loss on the tokens the model predicts. We must ignore padding tokens during loss calculation. If your loss is 0.0, your data collator is likely masking everything or your labels are identical to your input IDs without setting padding indices to -100.

Use the DataCollatorForCompletionOnlyLM (from trl) or the standard DataCollatorForSeq2Seq ensuring the tokenizer is passed correctly.

Here is the correct SFTTrainer setup to ensure the padding token fix propagates to the loss function:

from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

def train_setup(model, tokenizer, dataset):
    
    sft_config = SFTConfig(
        output_dir="./llama3-1-finetune",
        dataset_text_field="text",
        max_seq_length=2048,
        packing=False, # Set True only if you have huge data and want efficiency
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=False,
        bf16=True, # Recommended for Llama 3.1
        logging_steps=10,
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        tokenizer=tokenizer, # CRITICAL: Pass the fixed tokenizer here
        args=sft_config,
    )
    
    return trainer

Why Passing the Tokenizer Matters

When you pass the tokenizer instance to SFTTrainer, the library automatically detects pad_token_id. It then configures the data collator to replace the labels corresponding to padding tokens with -100.

In PyTorch CrossEntropyLoss, -100 is the default ignore_index. If this handshake fails (because you didn't set pad_token_id in the tokenizer manually step 1), the loss function tries to predict the padding token, leading to high loss divergence.

Edge Cases: The "Index Out of Range" Persistence

If you still encounter IndexError: index out of range after applying the fixes above, check these two edge cases:

1. The Added Token Trap

Did you add new special tokens using tokenizer.add_special_tokens? If yes, you must resize the model embeddings. The Llama 3.1 model matrix is strictly sized to the default vocabulary.

# Only run this if you ADDED new tokens, not if you just set the pad_token
tokenizer.add_special_tokens({'additional_special_tokens': ['<|my_new_token|>']})
model.resize_token_embeddings(len(tokenizer))

Warning: Resizing embeddings breaks the contiguous memory layout required by some LoRA implementations. You may need to enable modules_to_save=["embed_tokens", "lm_head"] in your LoRA config, which significantly increases VRAM usage.

2. Dataset Tokenization Overflow

If your max_seq_length is shorter than your actual data, and truncation is disabled or mishandled, the tokenizer might generate IDs that fall outside the clamped range in custom data loaders. Always ensure:

truncation=True, max_length=2048

is explicitly set in your preprocessing steps.

Conclusion

Fine-tuning Llama 3.1 requires unlearning habits from Llama 2. The key takeaways for stability are:

Explicitly set pad_token_id to 128001 (<|end_of_text|>).
Enforce padding_side='right' during training.
Validate that your chat template uses <|eot_id|> for turns, but the tokenizer uses <|end_of_text|> for padding.
Pass the tokenizer into the Trainer to auto-configure label masking (-100).

By aligning these token definitions, you ensure the model differentiates between "silence" (padding) and "stop" (EOS), allowing the loss to converge and the model to learn effectively.

Programming Tutorials

Search This Blog