Back to Blog
Headshot of Saša Jovanović

4 minutes read

How to use MuleSoft for efficient CSV data streaming

Saša Jovanović

Software Engineer

A common challenge in integrations involves automating the transfer and transformation of large datasets between remote systems. One of our clients recently encountered this issue, requiring them to move CSV data from one SFTP server to another while maintaining data integrity and performance. In this post, I’ll walk you through our solution using a simplified showcase. It does not use production-grade configurations or include client-specific logic.


Before we dive into the solution, let’s describe the challenge and key constraints.

The challenge

To successfully handle this integration problem we need to address the following four major requirements:

  1. Retrieve a large CSV file from a source SFTP server;
  2. Transform the data to match a target format (e.g., column mapping, data type conversion);
  3. Preserve the original record order throughout processing;
  4. Write the transformed data to a new file on a destination SFTP server;

keeping in mind that:

  1. The CSV files can be very large, often containing hundreds of thousands of records;
  2. The entire file must not be loaded into memory at once—the solution had to be stream-based and memory-efficient;
  3. Parallel processing is not allowed, as maintaining record order was critical for downstream systems.

The solution

Local setup

To simulate the integration environment, I will use Docker containers to simulate the source and target SFTP servers. This will allow us to develop and test locally with ease. 

sftp-origin: Source SFTP server container

docker run -p 2222:22 -d --name sftp-origin  -v C:/Users/PC/sftpdata/origin:/home/foo/upload atmoz/sftp foo:password:::upload


sftp-remote: Destination SFTP server container

docker run -p 2223:22 -d --name sftp-remote  -v C:/Users/PC/sftpdata/remote:/home/foo/upload atmoz/sftp foo:password:::upload

Each container maps a local folder to its remote /upload directory for straightforward file management.

For inspecting files and performing file transfers, you can use an SFTP client (e.g. I’m using FileZilla) with the following parameters: 

  • Host: localhost
  • Port: 2222 or 2223 (depending on the target SFTP container)
  • Username: foo
  • Password: password
  • Remote directory: /upload

Sample data

To test our implementation, we’ll use a test.csv file that contains 916,800 records. Each record represents a 30-minute time window for a specific store on a particular date, along with the visitor count.

store,date,start,end,visitors<br>1021,03/01/2024,09:00,09:29,15<br>1021,03/01/2024,09:30,09:59,10<br>1021,03/01/2024,10:00,10:29,28<br>1021,03/01/2024,10:30,10:59,22<br>1021,03/01/2024,11:00,11:29,24<br>1021,03/01/2024,11:30,11:59,39<br>1021,03/01/2024,12:00,12:29,14<br>1021,03/01/2024,12:30,12:59,35<br>...<br>

The structure of this dataset prioritizes the strict order because it is crucial for tracking visitor activity for each store over time.

MuleSoft implementation

Now, let’s take a closer look at the MuleSoft flow that powers this integration; from streaming and transforming data to writing the results to the destination SFTP server.

> Note: For clarity, I’ll omit error handling, but in production you should always include robust error handling, retries, and logging.

Here is our flow:

1. SFTP Listener

<sftp:listener doc:name="On New or Updated File" doc:id="1eeadde4-f731-476d-a7b3-c63852fbdd47" config-ref="sftpOriginConfig" outputMimeType='application/csv; bodystartlinenumber=1; headerlinenumber=0; streaming=true; separator=","; header=true' moveToDirectory="/upload/archive" renameTo='#["test_" ++ now() as String {format: "yyyy-MM-dd_HH-mm-ss"}]'><br>	<non-repeatable-stream /><br>	<scheduling-strategy ><br>		<fixed-frequency frequency="5" timeUnit="MINUTES" /><br>	</scheduling-strategy><br>	<sftp:matcher filenamePattern="*.csv"/><br></sftp:listener><br>

For detecting new files, we will use scheduled polling, every 5 minutes and configure it to look for CSV files only. To prevent loading the complete file into memory — crucial for large files, the streaming=true parameter ensures line-by-line stream processing and avoids memory overload. Also, by using the <non-repeatable-stream /> strategy we will read/consume our payload (each chunk) only once, aiming for better performance, without needing to keep it in the memory for further processing.

For post-read actions we want to:

  • Rename the file using custom format to include the timestamp: test_yyyy-MM-dd_HH-mm-ss.csv
  • Archive it to /upload/archive directory

2. Chunked processing

After we have enabled streaming and chosen a streaming strategy, we want to process it in chunks. For this, we use foreach loop and configure the batchSize. We want to partition the input array into sub-collections of a specified size.

<foreach doc:name="For Each" doc:id="cad2e08c-f764-4968-9824-53abdfb24ae3" batchSize="2000">
...
</foreach><br>

The `batchSize` provided is for example purposes only. The ideal value will vary based on the dataset’s structure, the application’s memory usage, and observed performance metrics.

So, if the input has 100,000 records:

  • Mule splits it into 50 chunks of 2000 records each (with the last chunk possibly smaller)
  • Each chunk becomes a payload in the scope
  • You can apply transformations to each chunk independently

With this, we achieve:

  • High throughput
  • Low memory footprint
  • Decreased I/O overhead

3. Determining write mode

Since we want to have the output in one file, we dynamically set whether to overwrite or append to the destination file using:

<foreach doc:name="For Each" doc:id="cad2e08c-f764-4968-9824-53abdfb24ae3" batchSize="2000">
...
<set-variable value='#[if(vars.counter == 1) "OVERWRITE" else "APPEND"]' variableName="sftpWriteMode" />
...
</foreach><br>
  • First chunk: OVERWRITE -> creates a new file
  • Subsequent chunks: APPEND -> extends the same existing file

This strategy avoids fragmentation and ensures data is written cumulatively in the correct order.

4. Writing to remote SFTP

Each batch of 2000 records is written in a single SFTP operation, which minimizes disk I/O and accelerates the overall process.

Output

The final exported file on the destination SFTP server mirrors the source structure but reflects the applied transformations (adding space between the columns). Data is written in the original order, in a memory-efficient manner, with minimal I/O overhead.

Conclusion

By leveraging streaming, chunked processing, and smart write-mode handling, we created a scalable MuleSoft integration that:

  • Handles large CSV files efficiently
  • Preserves record order throughout processing
  • Reduces memory usage and I/O bottlenecks
  • Is clean, maintainable, and production-ready

This solution is a strong example of how to design performant integrations in resource-constrained environments.

To see the complete source code and this deployment process in action, visit the GitHub repository. Feel free to explore, fork, or contribute to the project!  

Headshot of Saša Jovanović

Saša Jovanović

Software Engineer

Saša Jovanović is a software developer and content creator who has been part of the ProductDock team since 2022. His primary expertise lies in MuleSoft development, where he focuses on building reliable and scalable API integrations. With a strong background in backend systems, he is passionate about clean code and efficient workflows, consistently exploring new technologies to enhance development practices.


Related posts.