Parsing Complex CSVs and JSON in Salesforce using DataWeave in Apex

Processing multi-megabyte CSV files or deeply nested JSON architectures directly within Salesforce has historically been a perilous task. Developers frequently encounter System.LimitException: Too many heap size or System.LimitException: Apex CPU time limit exceeded when attempting to parse and transform this data.

Traditional approaches relying on standard string manipulation or regex fail to scale gracefully. By leveraging DataWeave in Apex, data engineers and Salesforce developers can offload complex payload transformations to a purpose-built engine, drastically reducing CPU time and memory consumption.

The Core Problem: Heap Limits and Immutable Strings

To understand why it is so difficult to parse CSV Salesforce Apex implementations natively, we must look at how the JVM-backed Apex runtime manages memory.

Strings in Apex are immutable. When you attempt to parse a CSV using String.split('\n'), the runtime does not simply place pointers across the existing string. It creates an entirely new array of strings. If you pass a 4MB CSV into a split function, your heap footprint instantly inflates to over 8MB. In synchronous transactions constrained by a 6MB heap limit, this guarantees immediate failure.

Furthermore, naive Apex implementations struggle with CSV standards (RFC 4180). Attempting to split by comma using String.split(',') corrupts data when fields contain escaped commas (e.g., "Doe, Jane"). Fixing this requires complex regular expressions, which exponentially drive up CPU time.

Similarly, native Apex JSON serialization—specifically deserializing heavily nested JSON via JSON.deserializeUntyped—requires the runtime to build a massive Map<String, Object> graph in memory. For Salesforce large data volume integrations, building this intermediate state before casting to SObjects is highly inefficient.

Implementing DataWeave in Apex

DataWeave is the functional programming language pioneered by MuleSoft for data transformation. With the General Availability (GA) of DataWeave in Apex, developers can now deploy .dwl scripts as metadata and invoke them directly from Apex.

The following implementation demonstrates how to parse a complex, comma-escaped CSV and transform it into a nested JSON structure, bypassing native Apex string parsing entirely.

Step 1: The DataWeave Script (`csvToNestedJson.dwl`)

Create a new file in your force-app/main/default/dw (or dataWeaves) directory named csvToNestedJson.dwl.

This script natively parses the CSV input, applies transformations, and outputs a structured JSON payload.

%dw 2.0
input records application/csv
output application/json
---
records map (record) -> {
    accountId: record.account_id,
    companyName: record.company_name,
    billingDetails: {
        street: record.street_address,
        city: record.city,
        state: record.state,
        postalCode: record.zip_code
    },
    // DataWeave handles type coercion cleanly
    isActive: record.status == "Active",
    annualRevenue: record.revenue as Number default 0
}

Step 2: The Apex Execution Class

DataWeave scripts are invoked in Apex using the DataWeave.Script namespace. The Apex class simply passes the raw string payload to the engine and requests the transformed output.

public with sharing class DataTransformationService {
    
    /**
     * Transforms a raw CSV string into a structured JSON string using DataWeave.
     * 
     * @param csvPayload The raw CSV string containing complex data (escaped quotes, commas).
     * @return A deeply nested JSON representation of the parsed CSV.
     */
    public static String convertCsvToJson(String csvPayload) {
        if (String.isBlank(csvPayload)) {
            return '[]';
        }

        try {
            // Instantiate the DataWeave script using the exact metadata file name
            DataWeave.Script dwScript = DataWeave.Script.createScript('csvToNestedJson');
            
            // Map the Apex variable to the DataWeave 'input' directive
            Map<String, Object> scriptInputs = new Map<String, Object>{
                'records' => csvPayload
            };
            
            // Execute the script
            DataWeave.Result dwResult = dwScript.execute(scriptInputs);
            
            // Retrieve the output as a serialized JSON string
            return dwResult.getValueAsString();
            
        } catch (Exception ex) {
            System.debug(LoggingLevel.ERROR, 'DataWeave Transformation Failed: ' + ex.getMessage());
            throw new AuraHandledException('Data processing failed. Please contact administration.');
        }
    }
}

Deep Dive: Architectural Advantages

When you execute the code above, the Apex runtime does not iterate over the CSV. Instead, the csvPayload string pointer is passed to the underlying DataWeave execution environment managed by Salesforce.

Memory Efficiency: The DataWeave engine operates using streaming architectures where possible. It parses the CSV and maps the JSON output without duplicating the payload into massive Apex Lists or Maps.
Bypassing Apex JSON Serialization Overhead: Normally, converting CSV to JSON in Apex requires defining wrapper classes, parsing the CSV to populate List<Wrapper>, and finally calling JSON.serialize(wrapperList). By defining output application/json in the DWL script, DataWeave handles the serialization natively in a lower-level, optimized environment.
Native RFC 4180 Compliance: DataWeave intrinsically understands CSV mechanics. Escaped quotes, commas within text fields, and varying line endings (\n vs \r\n) are handled automatically.

Common Pitfalls and Edge Cases

Handling Malformed Headers

If your incoming CSV files have inconsistent or missing headers, mapping by key (e.g., record.company_name) will result in null values. You can protect against this by reading the CSV as an array of arrays and utilizing index-based mapping.

To do this, modify the reader properties in the DWL script:

input records application/csv header=false

Then reference columns by index: record[0], record[1].

DataWeave Execution Limits

While DataWeave in Apex shields you from standard Apex CPU and heap limits during the transformation, it introduces its own governor limits. As of the GA release, a DataWeave execution is bound by a maximum CPU time allocation dedicated strictly to the DW runtime. Extremely large payloads (e.g., a 10MB CSV) should still be handled asynchronously (via Batch Apex or Queueables) where the heap limit is expanded to 12MB and the CPU limits are more forgiving.

Mapping Directly to SObjects

If your goal is to bypass JSON entirely and insert records, DataWeave can output directly to application/apex.

%dw 2.0
input records application/csv
output application/apex
---
records map (record) -> {
    Name: record.company_name,
    Industry: record.industry
} as Object {class: "Account"}

When retrieving the result in Apex, you cast the output directly to a list of SObjects:

List<Account> accountsToInsert = (List<Account>) dwResult.getValue();

This is the absolute most performant method for processing Salesforce large data volume imports natively on the platform.

Conclusion

Replacing legacy string parsing methods with DataWeave in Apex modernizes data integration architectures on the Salesforce platform. By delegating complex CSV parsing and JSON generation to an engine designed explicitly for data transformation, developers ensure their applications remain scalable, strictly within governor limits, and resilient against edge-case data formats.

Programming Tutorials

Search This Blog