Skip to content

Parquet & Postgres

Two of the four direct-ingestion backends. ParquetSource reads hive-partitioned parquet in place from local disk or object storage; PostgresSource reads a production PostgreSQL table or query read-only via a READ_ONLY DuckDB attach. Both live in src/daedalus/catalog/table.py, and both are direct sources — Daedalus reads them where they are. Neither is "downloaded" first.

ParquetSource

ParquetSource is a first-class direct source, not a staging area you copy into: point a feature view at an S3 / R2 / local hive-partitioned parquet root and the pipeline reads it directly. It is selected whenever the source path is set, and resolves to a recursive, hive-partitioned, union-by-name parquet read:

sql
read_parquet('<path>/**/*.parquet', hive_partitioning=true, union_by_name=true)

A path that already ends in .parquet is read as a single file; otherwise it is treated as a directory and globbed recursively.

Source paths are directories

By project convention, source paths are directories (recursive globs with hive partitioning) — not file enumerations. Point path at the partition root, not at individual files.

Local, S3, and R2

Local paths need no credentials. For an s3:// or r2:// path, set the s3_* fields so setup_conn creates the scoped DuckDB S3 secret needed to read the remote files:

sql
INSTALL httpfs; LOAD httpfs;
CREATE OR REPLACE SECRET __daeda_s3_<digest> (
  TYPE s3,
  KEY_ID '...', SECRET '...',
  REGION '...',      -- emitted only when set
  ENDPOINT '...',    -- emitted only when set (e.g. Cloudflare R2)
  SCOPE 's3://bucket/prefix'
);

The secret is scoped to the parquet root so multiple object-store paths with different credentials don't clobber each other. Repo-relative paths are resolved under repo_root by build_source; absolute paths and remote URIs are used as-is.

Feature-view YAML examples

Local directory:

yaml
# feature_views/feed_local.yaml
name: feed_events
entities:
  - user
  - artwork
source:
  name: feed_events_parquet
  path: "data/mewtant/feed"          # directory → recursive hive glob
  timestamp_field: event_timestamp
features:
  - name: event
    dtype: VARCHAR

S3 / R2 with scoped credentials:

yaml
source:
  name: feed_events_s3
  path: "s3://pixai-features/feed"
  s3_key_id: "${S3_KEY_ID}"
  s3_secret: "${S3_SECRET}"
  s3_region: "${S3_REGION}"
  # s3_endpoint: "${S3_ENDPOINT}"   # set for R2 / S3-compatible stores
  timestamp_field: event_timestamp

PostgresSource

PostgresSource is read-only: a feature view reads a production Postgres table or query in place, and Daedalus never writes back. It is selected when database_path is a postgres:// or postgresql:// URI. setup_conn loads the postgres extension and attaches the database READ_ONLY:

sql
INSTALL postgres; LOAD postgres;
ATTACH IF NOT EXISTS '<uri>' AS <attach_name> (TYPE POSTGRES, READ_ONLY);

Reference either a table_name or a query:

  • Tablesql_expr() is <attach_name>.<schema>.<table> (schema defaults to public).
  • Querysql_expr() is postgres_query('<attach_name>', '<query>'), pushing the query down to Postgres.

build_source requires either table_name or query.

Feature-view YAML examples

Table reference:

yaml
# feature_views/user_pg.yaml
name: user_profile
entities:
  - user
source:
  name: user_profile_postgres
  database_path: "postgresql://${PG_DSN}"   # full DSN via ${ENV}
  table_name: "user_profile"                 # schema defaults to public
  timestamp_field: updated_at
features:
  - name: country
    dtype: VARCHAR

Pushed query:

yaml
source:
  name: user_profile_postgres_q
  database_path: "postgresql://${PG_DSN}"
  query: "SELECT user_id, country, tier FROM public.user_profile WHERE active"
  timestamp_field: updated_at

Secrets via ${ENV}, never inline

Inject the Postgres DSN and every s3_* credential as ${ENV} references — never paste a literal password or access key into the YAML. to_dict() masks the connection string (a postgres:// / postgresql:// URI is reduced to its scheme prefix + *** unless it is entirely a single ${VAR} reference) and the s3_* secrets, so catalog show / service show / lineage never leak them.

See also the Sources Overview, DuckLake, Snowflake, Configuration, and the Architecture Overview.