Pandas to Polars: Optimizing 10M+ Row Datasets to Fix MemoryError

You are watching your ETL pipeline logs. The memory usage climbs steadily: 8GB, 12GB, 16GB. Then, the inevitable crash: MemoryError or the Linux OOM killer sigkills your process.

If you are processing datasets exceeding 10 million rows with Pandas, this scenario is a daily reality. While Pandas is the industry standard for exploration, its architectural design struggles with scale. It requires datasets to fit entirely in RAM, often needing 5x to 10x the dataset size in available memory to perform complex operations.

This article provides a direct, technical migration path from Pandas to Polars. We will solve the MemoryError not by buying more RAM, but by leveraging lazy evaluation, streaming execution, and Apache Arrow memory layouts.

The Root Cause: Why Pandas Explodes Memory

To fix the crash, you must understand why Pandas manages memory inefficiently compared to modern alternatives.

1. Eager Execution

Pandas is eager. When you execute a command like df = pd.read_csv('large_file.csv'), it immediately loads the entire file into memory. When you chain operations, Pandas often creates intermediate copies of the DataFrame for every step.

If you filter, group, and sort a 2GB dataset, Pandas might allocate intermediate objects that spike memory usage to 10GB+, triggering the crash.

2. The GIL and Single-Core Processing

Pandas relies heavily on NumPy. While NumPy releases the Global Interpreter Lock (GIL) for some matrix operations, much of Pandas' logic runs on a single core. It cannot natively parallelize complex string manipulations or aggregations across all available CPU cores.

3. Object Overhead

Pandas stores strings as Python objects (unless using the experimental PyArrow backend). This results in massive memory overhead, pointer indirection, and cache misses.

Polars solves this by:

Lazy Evaluation: It builds a query plan and optimizes it before touching the data.
Rust & Arrow: It uses Rust for performance and Apache Arrow for memory-efficient, columnar data storage.
Streaming: It can process datasets larger than RAM by handling data in chunks.

The Fix: From Pandas Eager Loading to Polars Streaming

Let's look at a typical scenario. We have a 10M+ row CSV file of transaction logs. We need to filter by date, calculate total revenue per region, and sort the results.

The Pandas Approach (Memory Hog)

This code is standard, but dangerous for large datasets.

import pandas as pd

# 1. Eagerly load the entire 5GB+ file into RAM
# CRASH RISK: High
df = pd.read_csv("transactions_10m.csv")

# 2. Convert date column (creates a copy or modifies in-place)
df['date'] = pd.to_datetime(df['date'])

# 3. Filter data
# Creates a new DataFrame copy in memory
recent_df = df[df['date'].dt.year >= 2023]

# 4. Group and Aggregate
result = recent_df.groupby('region')['amount'].sum().reset_index()

print(result)

The Polars Solution (Memory Optimized)

We will switch to Polars using the Lazy API. We do not use read_csv; we use scan_csv.

import polars as pl

# 1. Define the Query Plan (No data is read yet)
q = (
    pl.scan_csv("transactions_10m.csv")
    .with_columns(
        pl.col("date").str.to_date("%Y-%m-%d") # Parse dates efficiently
    )
    .filter(
        pl.col("date").dt.year() >= 2023
    )
    .group_by("region")
    .agg(
        pl.col("amount").sum().alias("total_revenue")
    )
    .sort("total_revenue", descending=True)
)

# 2. Execute with Streaming
# This processes data in chunks, keeping memory usage constant.
df_result = q.collect(streaming=True)

print(df_result)

Deep Dive: Why This Fix Works

The Polars solution introduces three critical changes that prevent the MemoryError.

1. `scan_csv` vs `read_csv`

When you call pl.scan_csv, Polars creates a LazyFrame. It inspects the file schema but reads nothing. It allows Polars to see the entire query context before execution.

If your pipeline only selects 2 columns out of 50, Polars pushes that projection down to the CSV reader. It will never read the unused 48 columns into RAM. Pandas would read all 50 columns first, then drop them.

2. Predicate Pushdown

In the code above, we filter for year >= 2023. Because Polars knows this before reading the data, it applies "predicate pushdown."

While reading the CSV, Polars checks the date value. If the row doesn't match the filter, it is discarded immediately. It never enters the expensive aggregation stage, drastically reducing memory pressure.

3. Streaming Engine (`streaming=True`)

The .collect(streaming=True) method is the most important feature for large datasets. It tells the engine to process the input in batches.

Instead of materializing the whole table, Polars reads a chunk, applies the filter, updates the running hash map for the group_by, and discards the chunk. You can process a 100GB dataset on a laptop with 8GB of RAM using this method.

Direct Mapping: Common Operations

To refactor your existing pipelines, map your Pandas logic to Polars expressions. Avoid iterating over rows or using Python lambda functions, as these kill parallelism.

Selecting and creating columns

Pandas:

df['total'] = df['quantity'] * df['price']
df = df[['id', 'total']]

Polars:

df = df.select([
    pl.col("id"),
    (pl.col("quantity") * pl.col("price")).alias("total")
])

Filtering with Logic

Pandas:

df = df[(df['status'] == 'active') & (df['score'] > 50)]

Polars:

df = df.filter(
    (pl.col("status") == "active") & (pl.col("score") > 50)
)

Handling Nulls

Pandas:

df['category'] = df['category'].fillna('unknown')

Polars:

df = df.with_columns(
    pl.col("category").fill_null("unknown")
)

Common Pitfalls and Edge Cases

Even with Polars, you can hit performance bottlenecks if you violate core principles.

1. Using Python User-Defined Functions (UDFs)

Avoid map_elements (formerly apply) whenever possible.

# BAD: Using Python lambda blocks parallel execution
df.select(pl.col("a").map_elements(lambda x: x * 2))

# GOOD: Using native Polars expressions (runs in Rust)
df.select(pl.col("a") * 2)

Using a Python lambda forces Polars to convert data back to Python objects, pass them to the Python interpreter, and wait for the GIL. This negates the speed advantage.

2. Strict Data Types

Pandas is permissive; it will silently convert integers to floats if you introduce a NaN. Polars is strict.

If you try to compare a string column to an integer, Polars raises a SchemaError or ComputeError. You must cast types explicitly using .cast(pl.Int64) or .cast(pl.Utf8).

3. Eager loading inside a loop

Do not use collect() inside a loop. Build a list of LazyFrames, use pl.concat, and call collect() once at the very end. This allows the query optimizer to parallelize the entire operation.

Conclusion

Moving from Pandas to Polars for datasets over 10 million rows is not just a syntax change; it is a shift from in-memory processing to query-optimized, out-of-core execution.

By replacing read_csv with scan_csv and leveraging collect(streaming=True), you eliminate the root causes of MemoryError. You gain multithreaded performance and type safety, ensuring your data pipelines are robust enough for production environments.

Programming Tutorials

Search This Blog