# Statistics

Pipeline runs record numeric statistics scoped to a **run**, **file**, or **step**, keyed by an **action** name.

Each update:

1. Logs an INFO line to `astro.stats` (visible in console, dashboard, and `astro.log`)
2. Upserts the value in SQLite (later calls replace the prior value for the same scope/subject/action)

Log format:

```text
STAT run=abc12 scope=step subject=validate action=rows_quarantined value=3
```

Run-scoped statistics use `subject=-` in log output.

## Step API

Each `StepContext` exposes a `stats` recorder:

```python
def step_transform(ctx: StepContext, files: list[AstroFile]) -> None:
    file = files[0]
    dataframe = file.load()
    ctx.stats.record_file(file.spec.__class__.ingest_name, "rows_read", dataframe.height)
    ctx.stats.record_run("custom_counter", 1)
    ctx.stats.record_step("rows_written", dataframe.height)
    file.save_in_place(dataframe)
```

| Method | Scope |
|--------|-------|
| `ctx.stats.record_run(action, value)` | Run |
| `ctx.stats.record_file(file_name, action, value)` | File (ingest name) |
| `ctx.stats.record_step(action, value)` | Current step |

`StatisticsRecorder` and `StatScope` are also exported from `astro` for use outside step functions.

## Built-in statistics

| Phase | Scope | Action | When recorded |
|-------|-------|--------|---------------|
| Ingest | file | `row_count`, `column_count`, `source_size_bytes` | After each file materializes |
| Ingest | run | `files_ingested` | After successful ingest |
| Ingest | run | `ingest_failed` | On ingest failure |
| Run | step | `duration_ms` | After each step executes |
| Run | step | `rows_quarantined` | When a step quarantines rows |
| Run | file | `rows_filtered`, `rows_kept` | After a filter step processes a file |
| Run | step | `rows_filtered` | After a filter step (total removed) |
| Run | run | `steps_completed` | When run finishes |
| Run | run | `duration_ms` | When run finishes |
| Run | run | `steps_quarantined` | When run finishes with quarantined steps |

## SQLite storage

Statistics are stored in `{pipeline_dir}/.astro/stats.db`:

- **`runs`**: `run_id`, `pipeline_name`, `status`, `source_directory`, `created_at`, `ingested_at`
- **`ingest_files`**: per-file row/column counts, source path, parquet path, source size
- **`statistics`**: generic metrics with `run_id`, `scope`, `subject`, `action`, `value`, `recorded_at`

See {doc}`working-directory` for the full directory layout.