How Did I Lower Zipped CSV Files Extraction and Filtering RAM Usage by 200x

Recently, I created a dashboard to monitor the status of Roraima's revenue agreements, available here. This open government data is sourced by the Governo Federal, specifically the TransfereGov system.

The open data for the dashboard is served as zipped CSV files (.csv.zip). These files, when unzipped, reach 1.7 GB at the time of writing and continue to grow. When using data manipulation tools such as pandas, the memory footprint was unsustainable for my server with only 4 GB of RAM, because other services were running and the filtering script read the files directly into memory.

The first solution was to migrate from pandas to polars, which uses LazyFrames to apply the filters I needed while reading the CSV files, retaining only the essential data in memory. This solution already reduced the script's RAM usage from 1.7 GB to 1 GB, a 41.18% reduction.

The solution should've worked even better, but there was an obstacle. Using the collect_schema function from polars, which reads an initial portion of the CSV file to infer its schema, was computationally expensive. Given the large number of lines required to correctly assume the schema, data inconsistencies increased the cost of this operation and drove peak RAM usage.

So, I had my initial solution, shown below. However, a better approach was needed.

#!/bin/sh

# Getting the CSV file
curl -s \
  -o siconv_proposta.csv.zip \
  'https://repositorio.dados.gov.br/seges/detru/siconv_proposta.csv.zip';
unzip siconv_proposta.csv.zip;

# Running the Python script to filter the data
uv run main.py

Initial solution

It was then that I found two tools, funzip and csvkit. funzip, a CLI tool similar to unzip but designed to handle a single zipped file with streaming support, was a significant discovery. Combined with csvkit, which can filter data streamed from funzip and hold only the necessary data in memory before writing to disk, I had the perfect combination. Applying both tools, I came to the optimized solution below.

#!/bin/sh

# Getting the CSV file
curl -s \
  'https://repositorio.dados.gov.br/seges/detru/siconv_proposta.csv.zip' \
  | funzip \ # Gets the streamed zipped data and streams the unzipped data
  | csvgrep -d ';' -c 'UF_PROPONENTE' -m 'RR' \ # Filters the streamed data
  > siconv_proposta.csv

# Running the Python script to handle complex data filters
uv run main.py

Optimized solution

Using these tools with curl's native streaming, I've reached a significant reduction in peak memory usage, from 1.7 GB to 8 MB, which was fine for my server. Not only that, but the Python script focused only on the heavy lifting, making it leaner.

So, the results were:

  • Memory footprint was lowered by 99.53% or 212x, from 1.7 GB to 8 MB
  • Less logic to filter the data
  • A server with more resources to handle other services

If there are any optimizations I missed that could further improve my memory savings, please reach out.