DuckLake

DuckLake is Daedalus's lakehouse backend: a relational catalog (Postgres / SQLite / a DuckDB file) over parquet data files in object storage. A feature view points its source at a DuckLake table and Daedalus reads it in place — there is no download or copy step. Both the read side (DuckLakeSource) and the write side (DuckLakeSink) follow the same DuckLake 1.0 ATTACH recipe, defined in src/daedalus/catalog/table.py and src/daedalus/catalog/sink.py.

The DuckLake 1.0 ATTACH recipe

Both source and sink share the catalog-backend loader (_load_ducklake_backend) and the attach-URI normaliser (_ducklake_attach_uri). The recipe is:

Install / load the extensions — always ducklake, plus the catalog-backend extension inferred from the scheme right after ducklake::
sql
```
INSTALL ducklake; LOAD ducklake;
INSTALL postgres;  LOAD postgres;   -- when ducklake:postgres:...
-- or: INSTALL sqlite; LOAD sqlite; -- when ducklake:sqlite:...
-- a bare ducklake:<file> (DuckDB-file catalog) needs no extra extension
```
The backend is matched case-insensitively, so postgres / postgresql / mixed-case spellings all load the right extension.
Create a scoped S3 secret — only when data_path (the lake's parquet DATA_PATH) is an s3:// / r2:// URI and an access key is set:
sql
```
INSTALL httpfs; LOAD httpfs;
CREATE OR REPLACE SECRET __daeda_s3_<digest> (
  TYPE s3,
  KEY_ID '...', SECRET '...',
  REGION '...',           -- emitted only when set
  ENDPOINT '...',         -- emitted only when set (e.g. Cloudflare R2)
  SCOPE 's3://bucket/lake/data'
);
```
The secret is named and scoped to the DATA_PATH so a service reading two object-store paths with different credentials gets one secret per path (DuckDB resolves each read to the longest-matching scope) instead of a single unscoped CREATE OR REPLACE clobbering the others.
ATTACH the catalog — pointing at the data path; READ_ONLY for reads:
sql
```
ATTACH IF NOT EXISTS 'ducklake:postgres:<libpq>' AS <alias>
  (DATA_PATH 's3://bucket/lake/data', READ_ONLY);
```
Tables are referenced schema-qualified through the attach alias: <alias>.main.<table> (so table should include the schema, e.g. main.dwd_…).

Catalog backends

The scheme right after ducklake: selects the catalog store:

`database_path` form	Catalog backend
`ducklake:postgres:<libpq>`	PostgreSQL (loads `postgres`)
`ducklake:sqlite:<path>`	SQLite (loads `sqlite`)
`ducklake:<file>`	a DuckDB-file catalog (no extra extension)

A bare value without the ducklake: prefix is prefixed automatically by _ducklake_attach_uri.

DuckLakeSource (read)

DuckLakeSource attaches READ_ONLY by default (read_only=True) — a read-only attach never mutates the production catalog. build_source always constructs DuckLake sources with read_only=True. Reference either a schema-qualified table or an inline query; the SQL expression is <alias>.<table> (or (<query>)).

Feature-view YAML example

yaml

# feature_views/artwork_properties.yaml
name: artwork_properties
entities:
  - artwork
source:
  name: artwork_properties_ducklake
  database_path: "ducklake:postgres:${DUCKLAKE_DSN}"   # libpq DSN via ${ENV}
  table_name: "main.dwd_artwork_properties"            # schema-qualified
  data_path: "s3://pixai-lake/artwork/data"            # the DATA_PATH
  catalog_alias: "lake"                                 # optional explicit alias
  s3_key_id: "${S3_KEY_ID}"
  s3_secret: "${S3_SECRET}"
  s3_region: "${S3_REGION}"
  # s3_endpoint: "${S3_ENDPOINT}"   # set for R2 / S3-compatible stores
  timestamp_field: event_timestamp
features:
  - name: author_id
    dtype: BIGINT
  - name: model_id
    dtype: VARCHAR

You can also push a query instead of a table:

yaml

source:
  name: artwork_recent_ducklake
  database_path: "ducklake:postgres:${DUCKLAKE_DSN}"
  query: "SELECT artwork_id, author_id FROM lake.main.dwd_artwork_properties"
  data_path: "s3://pixai-lake/artwork/data"
  catalog_alias: "lake"
  s3_key_id: "${S3_KEY_ID}"
  s3_secret: "${S3_SECRET}"

DuckLakeSource is DuckDB-only for SQL; Polars access goes through DuckDB's .pl() bridge → .lazy().

DuckLakeSink (write)

DuckLakeSink upserts results into a DuckLake table via MERGE INTO. It uses a writable attach (no READ_ONLY), so after the ATTACH it persists the production parquet write options on the catalog with set_option:

sql

ATTACH IF NOT EXISTS 'ducklake:postgres:<libpq>' AS <alias>
  (DATA_PATH 's3://bucket/lake/data');

CALL <alias>.set_option('parquet_compression', 'lz4');
CALL <alias>.set_option('parquet_version', 2);
CALL <alias>.set_option('data_inlining_row_limit', 100);

The merge matches on key_columns — UPDATE BY NAME on match, INSERT BY NAME on no match. DuckLake has no primary keys, so pass the natural key:

sql

MERGE INTO <alias>.<table> AS target
USING __merge_source AS source
ON target.<key> = source.<key>
WHEN MATCHED THEN UPDATE BY NAME
WHEN NOT MATCHED THEN INSERT BY NAME;

Matching by name makes the merge robust to column ordering. The write side creates the same scoped S3 secret as the read side when data_path is remote.

Secrets via ${ENV}, never inline

Inject the catalog DSN and every s3_* credential as ${ENV} references — never paste a literal libpq password or access key into the YAML. A database_path that mixes a literal credential with templated fields (e.g. ducklake:postgres:host=${PGHOST} password=hunter2) is masked to its scheme prefix + *** by to_dict(), but the safe practice is a fully-templated DSN so nothing literal is ever present to leak.

DuckLake ​

The DuckLake 1.0 ATTACH recipe ​

Catalog backends ​

DuckLakeSource (read) ​

Feature-view YAML example ​