Introduction¶
Astro helps you build reliable CSV import pipelines in Python. You define a pipeline in an external repository, ingest source files from the command line, and run ordered processing steps with built-in validation, statistics, filtering, and quarantine support.
What Astro provides¶
CLI control — run and manage pipelines from the command line
Library — define pipelines by importing Astro in your own repository
External pipelines — each pipeline lives in its own repo with a
pipeline.pyfileFolder ingestion — ingest a source directory containing one or more CSV files with heterogeneous schemas
Persistent statistics — store pipeline run statistics locally in SQLite
Typical workflow¶
Create a
pipeline.pythat declares ingest files, Pandera schemas, and run steps.Run
astro ingest path/to/source/to validate CSVs and write Parquet snapshots.Run
astro runto execute registered pipeline steps.Inspect logs, statistics, and quarantine files under
.working/{run_id}/.
Tech stack¶
Concern |
Choice |
|---|---|
Language |
Python 3.11+ |
DataFrame |
Polars |
Schema validation |
Pydantic |
Data validation |
Pandera |
CLI |
Typer |
Local storage |
SQLite ( |
Next steps¶
Installation — set up Astro locally
Quickstart — walk through a complete ingest and run
Defining pipelines — learn the pipeline contract
Security model¶
Astro discovers and executes pipeline.py from the directory you pass to -C / --pipeline-dir (default: current directory). That module is arbitrary Python running as your user, with the same file and network access as any other Python process you start. Only run Astro against pipeline repositories you trust.