Skip to content

Overview

Daedalus ingests data directly from your sources, read in place — there is no download or copy step. A feature view declares where its data lives in the YAML source block, and the pipeline reads that backend directly through one universal engine: DuckDB. (The legacy daeda download command predates this model and the platform will never need it going forward — see the CLI reference.)

A source is a config-only choice of one of four backends:

BackendAccessSource type
Postgresread-onlyPostgresSource
S3 Hive-partitioned Parquet (local / s3:// / r2://)readParquetSource
DuckLake tablesread (READ_ONLY attach)DuckLakeSource
Snowflakeread-only (query pushdown)SnowflakeSource

Postgres and Snowflake are read read-only: Daedalus attaches Postgres with READ_ONLY and only ever runs a pushed query against Snowflake — it never mutates either system. Daedalus dispatches to the right typed source at build time — no Python, no per-backend code in the pipeline.

How dispatch works

The catalog parses each view's source into a DataSourceDef (src/daedalus/catalog/model.py). When a consumer needs to read the source, it calls build_source(DataSourceDef) (src/daedalus/catalog/source_factory.py), which returns a typed Source (src/daedalus/catalog/table.py) whose sql_expr() / setup_conn() it splices into its own DuckDB query — instead of hardcoding read_parquet.

Dispatch is driven entirely by the source fields:

ConditionBackendSource type
path is setParquet (local / s3:// / r2://)ParquetSource
database_path starts with ducklake:DuckLake (read, READ_ONLY)DuckLakeSource
database_path starts with postgres:// or postgresql://PostgreSQLPostgresSource
database_path starts with snowflake://<account>Snowflake (optional extra)SnowflakeSource

path wins if set. Otherwise the scheme of database_path selects the backend. A database_path of :memory: (the default) with no path is not a usable source — build_source raises with guidance on what to set.

python
from pathlib import Path
from daedalus.catalog.source_factory import build_source, setup_source

src = build_source(view.source, repo_root=Path("/repo"))
# splice into a DuckDB query:
import duckdb
conn = duckdb.connect()
setup_source(conn, view.source, src)        # pre_queries + ATTACH + S3 secret
rows = conn.sql(f"SELECT * FROM {src.sql_expr()}").to_arrow_table()

setup_source runs the source's pre_queries (each ${ENV}-expanded) first, then the source's own setup_conn (the READ_ONLY attach + S3 secret creation).

DuckDB is the universal engine

Whatever the backend, the source resolves to a DuckDB SQL expression (sql_expr()) and an attach recipe (setup_conn()):

  • Parquetread_parquet(...) over a recursive hive-partitioned glob.
  • PostgresATTACH ... (TYPE POSTGRES, READ_ONLY), then a schema-qualified table reference or postgres_query(...).
  • DuckLakeINSTALL/LOAD ducklake + backend ext, then ATTACH 'ducklake:...' AS <alias> (DATA_PATH ..., READ_ONLY).
  • Snowflake → the pushed query runs in Snowflake; the Arrow result is registered as a local DuckDB view the pipeline reads.

Because every source ends up as a DuckDB relation, the rest of the pipeline stays backend-agnostic — operators read SELECT ... FROM <sql_expr> and never know (or care) which backend produced it.

Secrets are ${ENV} references, never inlined

Every credential / connection field — database_path, s3_key_id, s3_secret, snowflake_private_key, and friends — is written in YAML as a ${VAR} reference and expanded at build time from os.environ by expand_env (source_factory.py). Secrets are never stored on the DataSourceDef and never appear in definitions.

Secrets via ${ENV}, never inline

Always inject credentials with ${ENV_VAR} placeholders. Never paste a literal password, key, or token into a feature-view YAML.

  • expand_env raises KeyError (fail fast, loudly) if a referenced variable is unset — a misconfigured source never attaches with an empty credential.
  • to_dict() masks secrets via src/daedalus/catalog/redaction.py: a value that is entirely a single ${VAR} reference is shown verbatim (it carries no secret); anything mixing a recognised scheme with a literal credential is reduced to its scheme prefix + ***, and any other literal secret becomes ***. So catalog show / service show / lineage never leak credentials.

Per-backend pages

  • DuckLake — DuckLake 1.0 ATTACH recipe, DuckLakeSource (read) and DuckLakeSink (write).
  • Snowflake — optional extra, key-pair JWT auth, query pushdown to a local DuckDB view.
  • Parquet & Postgres — hive-partitioned parquet globs (local / S3 / R2) and READ_ONLY Postgres attach.

See also Configuration for where source YAML lives and how it is layered, and the Architecture Overview for how direct ingestion fits into the one-pipeline operator DAG.