DuckLake
DuckLake is Daedalus's lakehouse backend: a relational catalog (Postgres / SQLite / a DuckDB file) over parquet data files in object storage. A feature view points its source at a DuckLake table and Daedalus reads it in place — there is no download or copy step. Both the read side (DuckLakeSource) and the write side (DuckLakeSink) follow the same DuckLake 1.0 ATTACH recipe, defined in src/daedalus/catalog/table.py and src/daedalus/catalog/sink.py.
The DuckLake 1.0 ATTACH recipe
Both source and sink share the catalog-backend loader (_load_ducklake_backend) and the attach-URI normaliser (_ducklake_attach_uri). The recipe is:
Install / load the extensions — always
ducklake, plus the catalog-backend extension inferred from the scheme right afterducklake::sqlINSTALL ducklake; LOAD ducklake; INSTALL postgres; LOAD postgres; -- when ducklake:postgres:... -- or: INSTALL sqlite; LOAD sqlite; -- when ducklake:sqlite:... -- a bare ducklake:<file> (DuckDB-file catalog) needs no extra extensionThe backend is matched case-insensitively, so
postgres/postgresql/ mixed-case spellings all load the right extension.Create a scoped S3 secret — only when
data_path(the lake's parquetDATA_PATH) is ans3:///r2://URI and an access key is set:sqlINSTALL httpfs; LOAD httpfs; CREATE OR REPLACE SECRET __daeda_s3_<digest> ( TYPE s3, KEY_ID '...', SECRET '...', REGION '...', -- emitted only when set ENDPOINT '...', -- emitted only when set (e.g. Cloudflare R2) SCOPE 's3://bucket/lake/data' );The secret is named and scoped to the
DATA_PATHso a service reading two object-store paths with different credentials gets one secret per path (DuckDB resolves each read to the longest-matching scope) instead of a single unscopedCREATE OR REPLACEclobbering the others.ATTACH the catalog — pointing at the data path; READ_ONLY for reads:
sqlATTACH IF NOT EXISTS 'ducklake:postgres:<libpq>' AS <alias> (DATA_PATH 's3://bucket/lake/data', READ_ONLY);Tables are referenced schema-qualified through the attach alias:
<alias>.main.<table>(sotableshould include the schema, e.g.main.dwd_…).
Catalog backends
The scheme right after ducklake: selects the catalog store:
database_path form | Catalog backend |
|---|---|
ducklake:postgres:<libpq> | PostgreSQL (loads postgres) |
ducklake:sqlite:<path> | SQLite (loads sqlite) |
ducklake:<file> | a DuckDB-file catalog (no extra extension) |
A bare value without the ducklake: prefix is prefixed automatically by _ducklake_attach_uri.
DuckLakeSource (read)
DuckLakeSource attaches READ_ONLY by default (read_only=True) — a read-only attach never mutates the production catalog. build_source always constructs DuckLake sources with read_only=True. Reference either a schema-qualified table or an inline query; the SQL expression is <alias>.<table> (or (<query>)).
Feature-view YAML example
# feature_views/artwork_properties.yaml
name: artwork_properties
entities:
- artwork
source:
name: artwork_properties_ducklake
database_path: "ducklake:postgres:${DUCKLAKE_DSN}" # libpq DSN via ${ENV}
table_name: "main.dwd_artwork_properties" # schema-qualified
data_path: "s3://pixai-lake/artwork/data" # the DATA_PATH
catalog_alias: "lake" # optional explicit alias
s3_key_id: "${S3_KEY_ID}"
s3_secret: "${S3_SECRET}"
s3_region: "${S3_REGION}"
# s3_endpoint: "${S3_ENDPOINT}" # set for R2 / S3-compatible stores
timestamp_field: event_timestamp
features:
- name: author_id
dtype: BIGINT
- name: model_id
dtype: VARCHARYou can also push a query instead of a table:
source:
name: artwork_recent_ducklake
database_path: "ducklake:postgres:${DUCKLAKE_DSN}"
query: "SELECT artwork_id, author_id FROM lake.main.dwd_artwork_properties"
data_path: "s3://pixai-lake/artwork/data"
catalog_alias: "lake"
s3_key_id: "${S3_KEY_ID}"
s3_secret: "${S3_SECRET}"DuckLakeSource is DuckDB-only for SQL; Polars access goes through DuckDB's .pl() bridge → .lazy().
DuckLakeSink (write)
DuckLakeSink upserts results into a DuckLake table via MERGE INTO. It uses a writable attach (no READ_ONLY), so after the ATTACH it persists the production parquet write options on the catalog with set_option:
ATTACH IF NOT EXISTS 'ducklake:postgres:<libpq>' AS <alias>
(DATA_PATH 's3://bucket/lake/data');
CALL <alias>.set_option('parquet_compression', 'lz4');
CALL <alias>.set_option('parquet_version', 2);
CALL <alias>.set_option('data_inlining_row_limit', 100);The merge matches on key_columns — UPDATE BY NAME on match, INSERT BY NAME on no match. DuckLake has no primary keys, so pass the natural key:
MERGE INTO <alias>.<table> AS target
USING __merge_source AS source
ON target.<key> = source.<key>
WHEN MATCHED THEN UPDATE BY NAME
WHEN NOT MATCHED THEN INSERT BY NAME;Matching by name makes the merge robust to column ordering. The write side creates the same scoped S3 secret as the read side when data_path is remote.
Secrets via ${ENV}, never inline
Inject the catalog DSN and every s3_* credential as ${ENV} references — never paste a literal libpq password or access key into the YAML. A database_path that mixes a literal credential with templated fields (e.g. ducklake:postgres:host=${PGHOST} password=hunter2) is masked to its scheme prefix + *** by to_dict(), but the safe practice is a fully-templated DSN so nothing literal is ever present to leak.
See also the Sources Overview, Parquet & Postgres, Snowflake, Configuration, and the Architecture Overview.