Skip to main content

Parsing and Recovering Malformed XML in Python with lxml

 In data engineering, few things are as frustrating as a pipeline failure caused by a single malformed character in a 5GB XML feed.

If you rely on Python’s built-in xml.etree.ElementTree, you likely encounter the dreaded ParseError: not well-formed (invalid token). Standard XML parsers are designed to fail fast. According to the W3C specification, if XML is not strictly "well-formed," it is fatal.

However, the real world is messy. Legacy systems produce unescaped ampersands, web scrapers retrieve truncated responses, and third-party APIs often deliver "XML-ish" data that breaks strict validators. Halting execution is rarely an option.

This guide details how to implement robust, fault-tolerant XML parsing using Python and lxml.

The Root Cause: Why Standard Parsers Fail

To fix the problem, we must understand the mechanics of the failure. Python's standard library xml.etree.ElementTree is often backed by the Expat parser. Expat is a stream-oriented parser that enforces strict XML compliance.

When Expat encounters a syntax error—such as a missing closing tag, an unescaped special character (like & instead of &), or overlapping elements—it raises an exception immediately. It does not attempt to guess the author's intent.

The "Well-Formed" Requirement

XML parsers distinguish between "valid" (adheres to a DTD/Schema) and "well-formed" (syntactically correct).

A parser generally stops at:

  1. Tag Mismatches: <open>...<close> (wrong tag name).
  2. Attribute Errors: <item id=5> (missing quotes).
  3. Illegal Characters: Control characters or unescaped entities inside text nodes.

For data engineering pipelines processing millions of records, we need a parser that behaves more like a browser—lenient and heuristic—rather than a strict validator.

The Solution: lxml and libxml2 Recovery

The lxml library is the Pythonic binding for the C libraries libxml2 and libxslt. Unlike the standard library, libxml2 includes a powerful recovery mode designed to handle "tag soup."

The key is the lxml.etree.XMLParser class, specifically the recover=True argument. When enabled, the parser attempts to fix structure errors on the fly rather than raising an exception.

Prerequisites

Ensure you have lxml installed in your environment:

pip install lxml

Implementation: The Robust Parser

Below is a production-ready implementation of a fault-tolerant XML loader. It handles parsing errors, enforces encoding, and cleans up namespaces.

from lxml import etree
from typing import Optional
import logging

# Configure logging to capture parsing events
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def parse_broken_xml(xml_content: bytes, base_url: Optional[str] = None) -> Optional[etree._Element]:
    """
    Parses malformed XML bytes using lxml's recovery mode.
    
    Args:
        xml_content: The raw XML bytes (better for encoding detection).
        base_url: Optional URL for resolving relative paths.
        
    Returns:
        lxml.etree._Element: The root of the parsed tree, or None if fatal.
    """
    # 1. Configure the parser for maximum leniency
    parser = etree.XMLParser(
        recover=True,           # Attempt to fix broken tags
        encoding='utf-8',       # Explicit encoding usually safer for web data
        remove_blank_text=True, # Clean up whitespace noise
        huge_tree=True          # Allow very deep trees/large text nodes
    )

    try:
        # 2. Attempt parsing
        # We use fromstring with bytes to let lxml handle encoding declaration
        root = etree.fromstring(xml_content, parser=parser)
        
        # 3. Check for recovery errors
        # libxml2 logs recovery attempts to the parser's error log
        if parser.error_log:
            error_count = len(parser.error_log)
            logger.warning(f"XML parsed with {error_count} recovery events.")
            for error in parser.error_log:
                logger.debug(f"Line {error.line}: {error.message}")
                
        return root

    except etree.XMLSyntaxError as e:
        # This catches errors that even recover=True cannot fix
        logger.error(f"Fatal XML parsing error: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error during parsing: {e}")
        return None

# --- usage_example.py ---

# Simulation of a broken XML feed
# Errors: Unclosed <item>, unescaped '&', missing quote in attribute
bad_xml_data = b"""
<catalog>
    <product id="101">
        <name>Widget & Sons</name>
        <price>45.00
    </product>
    <product id=102>
        <name>Gadget Pro</name>
        <price>99.99</price>
    </product>
</catalog>
"""

logger.info("Attempting to parse malformed XML...")
root_element = parse_broken_xml(bad_xml_data)

if root_element is not None:
    logger.info("Parsing successful. Extracting data:")
    for product in root_element.findall('product'):
        name = product.findtext('name')
        price = product.findtext('price')
        logger.info(f"Product: {name} | Price: {price}")
else:
    logger.error("Failed to recover XML.")

Deep Dive: How Recovery Works

When recover=True is active, libxml2 employs heuristics to construct a valid tree despite the input errors.

1. Tag Inference

If the parser encounters <product><name>Item...</product>, it notices that <name> was never closed before <product> was closed. It implicitly inserts </name> before closing the parent.

2. Attribute Fixes

In the example <product id=102>, the parser detects the missing quotes around the attribute value. It guesses where the value ends (usually at a space or >) and normalizes the attribute in the resulting DOM.

3. Entity Handling

Standard parsers choke on & followed by text. The recovery parser treats unescaped ampersands as literal text rather than the start of an entity reference, provided they don't look like valid entities.

Handling Edge Cases: When recover=True Isn't Enough

While lxml is powerful, it is not magic. There are scenarios where data engineering pipelines require additional pre-processing.

1. The "Encoding Hell"

Sometimes XML declares encoding="utf-8" in the preamble, but the file actually contains latin-1 characters. lxml respects the preamble over the actual bytes.

If you encounter encoding errors even with recovery, strip the preamble and force the encoding:

def force_parse_encoding(raw_bytes: bytes, encoding='utf-8') -> etree._Element:
    # Remove XML declaration to prevent conflict with parser encoding
    clean_bytes = raw_bytes.split(b'?>', 1)[-1] if b'<?xml' in raw_bytes else raw_bytes
    
    parser = etree.XMLParser(recover=True, encoding=encoding)
    return etree.fromstring(clean_bytes, parser=parser)

2. Illegal Control Characters

XML 1.0 disallows most ASCII control characters (like \x00 through \x1F, except tab, newline, and carriage return). recover=True often fails on null bytes. You must scrub these using Regex before passing data to the parser.

import re

def scrub_control_chars(content: str) -> str:
    # Regex to identify illegal XML characters
    illegal_xml_chars_re = re.compile(
        u'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x84\x86-\x9f\ud800-\udfff\ufdd0-\ufddf\ufffe\uffff]'
    )
    return illegal_xml_chars_re.sub('', content)

3. HTML as XML

If the data is extremely unstructured (e.g., scraped web pages masquerading as RSS feeds), lxml.etree might discard too much data during recovery.

In these cases, switch to lxml.html. The HTML parser is designed for the chaos of the web and produces a tree compatible with ElementTree API.

from lxml import html

def parse_soup(content: bytes):
    # lxml.html handles "tag soup" better than etree recovery
    tree = html.fromstring(content)
    # Now you can use Xpath just like standard XML
    return tree.xpath('//product/name/text()')

Performance Considerations

Using recover=True imposes a slight performance penalty compared to strict parsing, as the underlying C library must perform additional checks and branches. However, compared to pure Python alternatives (like BeautifulSoup used without the lxml backend), lxml remains significantly faster.

For multi-gigabyte files, avoid fromstring. Instead, use etree.iterparse with a recover-enabled context, though note that iterparse has limited recovery capabilities compared to loading the full DOM, as it cannot look ahead to resolve ambiguity.

Conclusion

Data integrity often conflicts with system stability. By switching from xml.etree to lxml with recovery mode enabled, you can prevent rigid syntax requirements from crashing your data pipelines.

Always log recovery events. While the parser can fix the structure, knowing that the data was malformed allows you to flag the source for long-term quality improvement.