Ingest¶
astro ingest SOURCE_DIR creates a new pipeline run under .working/{run_id}/. SOURCE_DIR must be a directory containing one or more CSV source files.
Run directory after ingest¶
.working/
abcde/
manifest.json
ingested/
establishments.parquet
astro.log
Ingest behaviour¶
Validate
SOURCE_DIRcontains exactly the expected CSV files (no extras, no subdirectories)Validate each CSV against its Pandera schema
Write Parquet files to
.working/{run_id}/ingested/Record run and file statistics in
.astro/stats.dbUpdate
manifest.jsonwith statusingested
IngestFileSpec¶
Declare expected source files in ingest_files:
IngestFileSpec(
name="establishments",
source_pattern="edubase*.csv",
schema=pa.DataFrameSchema(
{"URN": pa.Column(str), "EstablishmentName": pa.Column(str)},
strict="filter",
),
encoding="utf-8", # optional, default utf-8
has_header=True, # optional, default True
column_names=None, # required when has_header=False
)
CSV dtypes are derived from the Pandera schema to avoid loading all columns as strings.
name values must be unique across the pipeline. Names must start with an alphanumeric character and may contain letters, numbers, ., _, and -. Set column_names when has_header=False; Astro uses those names with the Pandera schema when reading headerless CSVs.
Execution modes¶
Mode |
Behaviour |
|---|---|
|
Fail ingest if any run under |
|
Allow multiple incomplete runs concurrently |
Set on your pipeline class:
class ExamplePipeline(Pipeline):
execution_mode = ExecutionMode.SERIAL
Run IDs are 5-character lowercase alphanumeric strings.
Large-file ingest¶
Files at or above large_file_threshold_bytes (default 100MB) use batched validation and append to Parquet via PyArrow. Small files use the eager path. See Large files for tuning batch sizes and streaming behaviour.
During large-file ingest, the CLI shows a progress bar with estimated row counts.
Logging¶
astro ingest prints logs to the console and writes them to .working/{run_id}/astro.log. Each session starts with a timestamp separator:
================================================================================
Astro session started: 2026-05-22T14:30:00.123456+00:00 command=ingest run_id=abc12
================================================================================
Next steps¶
Running pipelines — execute pipeline steps after ingest
Quickstart — end-to-end walkthrough