Overview
Daedalus ingests data directly from your sources, read in place — there is no download or copy step. A feature view declares where its data lives in the YAML source block, and the pipeline reads that backend directly through one universal engine: DuckDB. (The legacy daeda download command predates this model and the platform will never need it going forward — see the CLI reference.)
A source is a config-only choice of one of four backends:
| Backend | Access | Source type |
|---|---|---|
| Postgres | read-only | PostgresSource |
S3 Hive-partitioned Parquet (local / s3:// / r2://) | read | ParquetSource |
| DuckLake tables | read (READ_ONLY attach) | DuckLakeSource |
| Snowflake | read-only (query pushdown) | SnowflakeSource |
Postgres and Snowflake are read read-only: Daedalus attaches Postgres with READ_ONLY and only ever runs a pushed query against Snowflake — it never mutates either system. Daedalus dispatches to the right typed source at build time — no Python, no per-backend code in the pipeline.
How dispatch works
The catalog parses each view's source into a DataSourceDef (src/daedalus/catalog/model.py). When a consumer needs to read the source, it calls build_source(DataSourceDef) (src/daedalus/catalog/source_factory.py), which returns a typed Source (src/daedalus/catalog/table.py) whose sql_expr() / setup_conn() it splices into its own DuckDB query — instead of hardcoding read_parquet.
Dispatch is driven entirely by the source fields:
| Condition | Backend | Source type |
|---|---|---|
path is set | Parquet (local / s3:// / r2://) | ParquetSource |
database_path starts with ducklake: | DuckLake (read, READ_ONLY) | DuckLakeSource |
database_path starts with postgres:// or postgresql:// | PostgreSQL | PostgresSource |
database_path starts with snowflake://<account> | Snowflake (optional extra) | SnowflakeSource |
path wins if set. Otherwise the scheme of database_path selects the backend. A database_path of :memory: (the default) with no path is not a usable source — build_source raises with guidance on what to set.
from pathlib import Path
from daedalus.catalog.source_factory import build_source, setup_source
src = build_source(view.source, repo_root=Path("/repo"))
# splice into a DuckDB query:
import duckdb
conn = duckdb.connect()
setup_source(conn, view.source, src) # pre_queries + ATTACH + S3 secret
rows = conn.sql(f"SELECT * FROM {src.sql_expr()}").to_arrow_table()setup_source runs the source's pre_queries (each ${ENV}-expanded) first, then the source's own setup_conn (the READ_ONLY attach + S3 secret creation).
DuckDB is the universal engine
Whatever the backend, the source resolves to a DuckDB SQL expression (sql_expr()) and an attach recipe (setup_conn()):
- Parquet →
read_parquet(...)over a recursive hive-partitioned glob. - Postgres →
ATTACH ... (TYPE POSTGRES, READ_ONLY), then a schema-qualified table reference orpostgres_query(...). - DuckLake →
INSTALL/LOAD ducklake+ backend ext, thenATTACH 'ducklake:...' AS <alias> (DATA_PATH ..., READ_ONLY). - Snowflake → the pushed
queryruns in Snowflake; the Arrow result is registered as a local DuckDB view the pipeline reads.
Because every source ends up as a DuckDB relation, the rest of the pipeline stays backend-agnostic — operators read SELECT ... FROM <sql_expr> and never know (or care) which backend produced it.
Secrets are ${ENV} references, never inlined
Every credential / connection field — database_path, s3_key_id, s3_secret, snowflake_private_key, and friends — is written in YAML as a ${VAR} reference and expanded at build time from os.environ by expand_env (source_factory.py). Secrets are never stored on the DataSourceDef and never appear in definitions.
Secrets via ${ENV}, never inline
Always inject credentials with ${ENV_VAR} placeholders. Never paste a literal password, key, or token into a feature-view YAML.
expand_envraisesKeyError(fail fast, loudly) if a referenced variable is unset — a misconfigured source never attaches with an empty credential.to_dict()masks secrets viasrc/daedalus/catalog/redaction.py: a value that is entirely a single${VAR}reference is shown verbatim (it carries no secret); anything mixing a recognised scheme with a literal credential is reduced to its scheme prefix +***, and any other literal secret becomes***. Socatalog show/service show/ lineage never leak credentials.
Per-backend pages
- DuckLake — DuckLake 1.0 ATTACH recipe,
DuckLakeSource(read) andDuckLakeSink(write). - Snowflake — optional extra, key-pair JWT auth, query pushdown to a local DuckDB view.
- Parquet & Postgres — hive-partitioned parquet globs (local / S3 / R2) and READ_ONLY Postgres attach.
See also Configuration for where source YAML lives and how it is layered, and the Architecture Overview for how direct ingestion fits into the one-pipeline operator DAG.