Large files¶
Astro uses batched I/O for files at or above large_file_threshold_bytes (default 100MB). This applies to ingest, filter steps, and quarantine appends.
Pipeline tuning¶
Configure on your pipeline class:
class ExamplePipeline(Pipeline):
large_file_threshold_bytes = 100 * 1024 * 1024 # 100MB
ingest_batch_size = 100_000 # CSV rows per ingest batch
run_batch_size = 100_000 # Parquet rows per filter/batch iteration
Setting |
Default |
Purpose |
|---|---|---|
|
100MB |
Trigger batched paths |
|
100,000 |
CSV rows validated and written per batch during ingest |
|
100,000 |
Parquet rows processed per batch during filter and |
Ingest¶
Large-file ingest reads CSVs in a single pass, validates each batch with Pandera, and appends to a single Parquet file via PyArrow. Small files use the eager path.
UTF-8 files:
scan_csv+collect_batchesNon-UTF-8 encodings (e.g. cp1252): PyArrow streaming CSV reader
The CLI shows a progress bar during large-file ingest, with row counts estimated from a file-size sample.
Filter and quarantine¶
Filter steps and quarantine appends use batched Parquet I/O above the threshold, avoiding full in-memory loads.
Bypass batched ingest temporarily¶
To force the eager path for a one-off run (fast but memory-heavy), raise the threshold above your file size:
large_file_threshold_bytes = 10 * 1024 * 1024 * 1024 # 10GB
Next steps¶
Ingest — ingest behaviour
Steps and file I/O —
scan(),sink(), anditer_batches()