# Large files Astro uses batched I/O for files at or above `large_file_threshold_bytes` (default **100MB**). This applies to ingest, filter steps, and quarantine appends. ## Pipeline tuning Configure on your pipeline class: ```python class ExamplePipeline(Pipeline): large_file_threshold_bytes = 100 * 1024 * 1024 # 100MB ingest_batch_size = 100_000 # CSV rows per ingest batch run_batch_size = 100_000 # Parquet rows per filter/batch iteration ``` | Setting | Default | Purpose | |---------|---------|---------| | `large_file_threshold_bytes` | 100MB | Trigger batched paths | | `ingest_batch_size` | 100,000 | CSV rows validated and written per batch during ingest | | `run_batch_size` | 100,000 | Parquet rows processed per batch during filter and `iter_batches()` | ## Ingest Large-file ingest reads CSVs in a single pass, validates each batch with Pandera, and appends to a single Parquet file via PyArrow. Small files use the eager path. - UTF-8 files: `scan_csv` + `collect_batches` - Non-UTF-8 encodings (e.g. cp1252): PyArrow streaming CSV reader The CLI shows a progress bar during large-file ingest, with row counts estimated from a file-size sample. ## Filter and quarantine Filter steps and quarantine appends use batched Parquet I/O above the threshold, avoiding full in-memory loads. ## Step authors For large files in custom steps, prefer lazy I/O: ```python def step_transform(_ctx: StepContext, files: list[AstroFile]) -> None: file = files[0] if file.is_large_file(): lazy_frame = file.scan().with_columns(pl.lit("processed").alias("stage")) file.save_in_place_lazy(lazy_frame) return file.save_in_place(file.load().with_columns(pl.lit("processed").alias("stage"))) ``` Or iterate batches: ```python for batch in file.iter_batches(): # process batch ... ``` ## Bypass batched ingest temporarily To force the eager path for a one-off run (fast but memory-heavy), raise the threshold above your file size: ```python large_file_threshold_bytes = 10 * 1024 * 1024 * 1024 # 10GB ``` ## Next steps - {doc}`ingest` — ingest behaviour - {doc}`steps-and-files` — `scan()`, `sink()`, and `iter_batches()`