Canonical ID resolver¶
Astro provides a Polars-native library for mapping source entries to stable canonical UUIDs and detecting grouped field changes.
Storage¶
Named stores live at
{pipeline_dir}/.persistent/{name}.parquetEach store is scoped to a resolver instance (for example
establishments,links)Writes use atomic temp-file rename
Usage¶
from datetime import date
from pathlib import Path
import polars as pl
from astro import CanonicalIdResolver
resolver = CanonicalIdResolver(
pipeline_dir=Path("/path/to/pipeline"),
name="establishments",
hash_groups={
"entry_changed": "*all",
"address_changed": ["address1", "address2", "postcode"],
"owner_changed": ["trust (code)"],
},
)
result = resolver.resolve(
data=df,
source_key_column="source_key",
namespace="establishments",
run_date=date.today(),
)
Inputs¶
Parameter |
Purpose |
|---|---|
|
Pipeline-provided identifier within a source file |
|
Prefixes the stored key as |
|
Dict mapping group names to |
|
Date-only value used for change tracking (no time component) |
Hash groups use SHA-256 over canonicalized field values (null/blank normalized, \x1f separator).
Outputs¶
Each row is augmented with:
Column |
Meaning |
|---|---|
|
Stable UUID v4 string |
|
|
|
Boolean flag per hash group |
Persistent record¶
Each stored entry retains:
source_key— namespaced keycanonical_id{group}_hashcolumns for each configured hash grouplast_changed_date— date of the most recent hash changeupdate_dates— list of dates the entry was created or changed
Performance¶
Resolution is vectorized with Polars joins and expressions. UUID assignment loops only over new keys. The design targets batches up to 75K rows against stores up to 250K entries.
See API reference for the full API reference.