Parquet & Postgres
Two of the four direct-ingestion backends. ParquetSource reads hive-partitioned parquet in place from local disk or object storage; PostgresSource reads a production PostgreSQL table or query read-only via a READ_ONLY DuckDB attach. Both live in src/daedalus/catalog/table.py, and both are direct sources — Daedalus reads them where they are. Neither is "downloaded" first.
ParquetSource
ParquetSource is a first-class direct source, not a staging area you copy into: point a feature view at an S3 / R2 / local hive-partitioned parquet root and the pipeline reads it directly. It is selected whenever the source path is set, and resolves to a recursive, hive-partitioned, union-by-name parquet read:
read_parquet('<path>/**/*.parquet', hive_partitioning=true, union_by_name=true)A path that already ends in .parquet is read as a single file; otherwise it is treated as a directory and globbed recursively.
Source paths are directories
By project convention, source paths are directories (recursive globs with hive partitioning) — not file enumerations. Point path at the partition root, not at individual files.
Local, S3, and R2
Local paths need no credentials. For an s3:// or r2:// path, set the s3_* fields so setup_conn creates the scoped DuckDB S3 secret needed to read the remote files:
INSTALL httpfs; LOAD httpfs;
CREATE OR REPLACE SECRET __daeda_s3_<digest> (
TYPE s3,
KEY_ID '...', SECRET '...',
REGION '...', -- emitted only when set
ENDPOINT '...', -- emitted only when set (e.g. Cloudflare R2)
SCOPE 's3://bucket/prefix'
);The secret is scoped to the parquet root so multiple object-store paths with different credentials don't clobber each other. Repo-relative paths are resolved under repo_root by build_source; absolute paths and remote URIs are used as-is.
Feature-view YAML examples
Local directory:
# feature_views/feed_local.yaml
name: feed_events
entities:
- user
- artwork
source:
name: feed_events_parquet
path: "data/mewtant/feed" # directory → recursive hive glob
timestamp_field: event_timestamp
features:
- name: event
dtype: VARCHARS3 / R2 with scoped credentials:
source:
name: feed_events_s3
path: "s3://pixai-features/feed"
s3_key_id: "${S3_KEY_ID}"
s3_secret: "${S3_SECRET}"
s3_region: "${S3_REGION}"
# s3_endpoint: "${S3_ENDPOINT}" # set for R2 / S3-compatible stores
timestamp_field: event_timestampPostgresSource
PostgresSource is read-only: a feature view reads a production Postgres table or query in place, and Daedalus never writes back. It is selected when database_path is a postgres:// or postgresql:// URI. setup_conn loads the postgres extension and attaches the database READ_ONLY:
INSTALL postgres; LOAD postgres;
ATTACH IF NOT EXISTS '<uri>' AS <attach_name> (TYPE POSTGRES, READ_ONLY);Reference either a table_name or a query:
- Table →
sql_expr()is<attach_name>.<schema>.<table>(schema defaults topublic). - Query →
sql_expr()ispostgres_query('<attach_name>', '<query>'), pushing the query down to Postgres.
build_source requires either table_name or query.
Feature-view YAML examples
Table reference:
# feature_views/user_pg.yaml
name: user_profile
entities:
- user
source:
name: user_profile_postgres
database_path: "postgresql://${PG_DSN}" # full DSN via ${ENV}
table_name: "user_profile" # schema defaults to public
timestamp_field: updated_at
features:
- name: country
dtype: VARCHARPushed query:
source:
name: user_profile_postgres_q
database_path: "postgresql://${PG_DSN}"
query: "SELECT user_id, country, tier FROM public.user_profile WHERE active"
timestamp_field: updated_atSecrets via ${ENV}, never inline
Inject the Postgres DSN and every s3_* credential as ${ENV} references — never paste a literal password or access key into the YAML. to_dict() masks the connection string (a postgres:// / postgresql:// URI is reduced to its scheme prefix + *** unless it is entirely a single ${VAR} reference) and the s3_* secrets, so catalog show / service show / lineage never leak them.
See also the Sources Overview, DuckLake, Snowflake, Configuration, and the Architecture Overview.