# Ingest `astro ingest SOURCE_DIR` creates a new pipeline run under `.working/{run_id}/`. `SOURCE_DIR` must be a directory containing one or more CSV source files. ## Run directory after ingest ```text .working/ abcde/ manifest.json ingested/ establishments.parquet astro.log ``` ## Ingest behaviour 1. Validate `SOURCE_DIR` contains exactly the expected CSV files (no extras, no subdirectories) 2. Validate each CSV against its Pandera schema 3. Write Parquet files to `.working/{run_id}/ingested/` 4. Record run and file statistics in `.astro/stats.db` 5. Update `manifest.json` with status `ingested` ## IngestFileSpec Declare expected source files in `ingest_files`: ```python IngestFileSpec( name="establishments", source_pattern="edubase*.csv", schema=pa.DataFrameSchema( {"URN": pa.Column(str), "EstablishmentName": pa.Column(str)}, strict="filter", ), encoding="utf-8", # optional, default utf-8 has_header=True, # optional, default True column_names=None, # required when has_header=False ) ``` CSV dtypes are derived from the Pandera schema to avoid loading all columns as strings. `name` values must be unique across the pipeline. Names must start with an alphanumeric character and may contain letters, numbers, `.`, `_`, and `-`. Set `column_names` when `has_header=False`; Astro uses those names with the Pandera schema when reading headerless CSVs. ## Execution modes | Mode | Behaviour | |------|-----------| | `serial` | Fail ingest if any run under `.working/` is not `completed` | | `parallel` | Allow multiple incomplete runs concurrently | Set on your pipeline class: ```python class ExamplePipeline(Pipeline): execution_mode = ExecutionMode.SERIAL ``` Run IDs are 5-character lowercase alphanumeric strings. ## Large-file ingest Files at or above `large_file_threshold_bytes` (default 100MB) use batched validation and append to Parquet via PyArrow. Small files use the eager path. See {doc}`large-files` for tuning batch sizes and streaming behaviour. During large-file ingest, the CLI shows a progress bar with estimated row counts. ## Logging `astro ingest` prints logs to the console and writes them to `.working/{run_id}/astro.log`. Each session starts with a timestamp separator: ```text ================================================================================ Astro session started: 2026-05-22T14:30:00.123456+00:00 command=ingest run_id=abc12 ================================================================================ ``` ## Next steps - {doc}`running` — execute pipeline steps after ingest - {doc}`../getting-started/quickstart` — end-to-end walkthrough