Skip to content

DuckLake

DuckLake is Daedalus's lakehouse backend: a relational catalog (Postgres / SQLite / a DuckDB file) over parquet data files in object storage. A feature view points its source at a DuckLake table and Daedalus reads it in place — there is no download or copy step. Both the read side (DuckLakeSource) and the write side (DuckLakeSink) follow the same DuckLake 1.0 ATTACH recipe, defined in src/daedalus/catalog/table.py and src/daedalus/catalog/sink.py.

The DuckLake 1.0 ATTACH recipe

Both source and sink share the catalog-backend loader (_load_ducklake_backend) and the attach-URI normaliser (_ducklake_attach_uri). The recipe is:

  1. Install / load the extensions — always ducklake, plus the catalog-backend extension inferred from the scheme right after ducklake::

    sql
    INSTALL ducklake; LOAD ducklake;
    INSTALL postgres;  LOAD postgres;   -- when ducklake:postgres:...
    -- or: INSTALL sqlite; LOAD sqlite; -- when ducklake:sqlite:...
    -- a bare ducklake:<file> (DuckDB-file catalog) needs no extra extension

    The backend is matched case-insensitively, so postgres / postgresql / mixed-case spellings all load the right extension.

  2. Create a scoped S3 secret — only when data_path (the lake's parquet DATA_PATH) is an s3:// / r2:// URI and an access key is set:

    sql
    INSTALL httpfs; LOAD httpfs;
    CREATE OR REPLACE SECRET __daeda_s3_<digest> (
      TYPE s3,
      KEY_ID '...', SECRET '...',
      REGION '...',           -- emitted only when set
      ENDPOINT '...',         -- emitted only when set (e.g. Cloudflare R2)
      SCOPE 's3://bucket/lake/data'
    );

    The secret is named and scoped to the DATA_PATH so a service reading two object-store paths with different credentials gets one secret per path (DuckDB resolves each read to the longest-matching scope) instead of a single unscoped CREATE OR REPLACE clobbering the others.

  3. ATTACH the catalog — pointing at the data path; READ_ONLY for reads:

    sql
    ATTACH IF NOT EXISTS 'ducklake:postgres:<libpq>' AS <alias>
      (DATA_PATH 's3://bucket/lake/data', READ_ONLY);

    Tables are referenced schema-qualified through the attach alias: <alias>.main.<table> (so table should include the schema, e.g. main.dwd_…).

Catalog backends

The scheme right after ducklake: selects the catalog store:

database_path formCatalog backend
ducklake:postgres:<libpq>PostgreSQL (loads postgres)
ducklake:sqlite:<path>SQLite (loads sqlite)
ducklake:<file>a DuckDB-file catalog (no extra extension)

A bare value without the ducklake: prefix is prefixed automatically by _ducklake_attach_uri.

DuckLakeSource (read)

DuckLakeSource attaches READ_ONLY by default (read_only=True) — a read-only attach never mutates the production catalog. build_source always constructs DuckLake sources with read_only=True. Reference either a schema-qualified table or an inline query; the SQL expression is <alias>.<table> (or (<query>)).

Feature-view YAML example

yaml
# feature_views/artwork_properties.yaml
name: artwork_properties
entities:
  - artwork
source:
  name: artwork_properties_ducklake
  database_path: "ducklake:postgres:${DUCKLAKE_DSN}"   # libpq DSN via ${ENV}
  table_name: "main.dwd_artwork_properties"            # schema-qualified
  data_path: "s3://pixai-lake/artwork/data"            # the DATA_PATH
  catalog_alias: "lake"                                 # optional explicit alias
  s3_key_id: "${S3_KEY_ID}"
  s3_secret: "${S3_SECRET}"
  s3_region: "${S3_REGION}"
  # s3_endpoint: "${S3_ENDPOINT}"   # set for R2 / S3-compatible stores
  timestamp_field: event_timestamp
features:
  - name: author_id
    dtype: BIGINT
  - name: model_id
    dtype: VARCHAR

You can also push a query instead of a table:

yaml
source:
  name: artwork_recent_ducklake
  database_path: "ducklake:postgres:${DUCKLAKE_DSN}"
  query: "SELECT artwork_id, author_id FROM lake.main.dwd_artwork_properties"
  data_path: "s3://pixai-lake/artwork/data"
  catalog_alias: "lake"
  s3_key_id: "${S3_KEY_ID}"
  s3_secret: "${S3_SECRET}"

DuckLakeSource is DuckDB-only for SQL; Polars access goes through DuckDB's .pl() bridge → .lazy().

DuckLakeSink (write)

DuckLakeSink upserts results into a DuckLake table via MERGE INTO. It uses a writable attach (no READ_ONLY), so after the ATTACH it persists the production parquet write options on the catalog with set_option:

sql
ATTACH IF NOT EXISTS 'ducklake:postgres:<libpq>' AS <alias>
  (DATA_PATH 's3://bucket/lake/data');

CALL <alias>.set_option('parquet_compression', 'lz4');
CALL <alias>.set_option('parquet_version', 2);
CALL <alias>.set_option('data_inlining_row_limit', 100);

The merge matches on key_columnsUPDATE BY NAME on match, INSERT BY NAME on no match. DuckLake has no primary keys, so pass the natural key:

sql
MERGE INTO <alias>.<table> AS target
USING __merge_source AS source
ON target.<key> = source.<key>
WHEN MATCHED THEN UPDATE BY NAME
WHEN NOT MATCHED THEN INSERT BY NAME;

Matching by name makes the merge robust to column ordering. The write side creates the same scoped S3 secret as the read side when data_path is remote.

Secrets via ${ENV}, never inline

Inject the catalog DSN and every s3_* credential as ${ENV} references — never paste a literal libpq password or access key into the YAML. A database_path that mixes a literal credential with templated fields (e.g. ducklake:postgres:host=${PGHOST} password=hunter2) is masked to its scheme prefix + *** by to_dict(), but the safe practice is a fully-templated DSN so nothing literal is ever present to leak.

See also the Sources Overview, Parquet & Postgres, Snowflake, Configuration, and the Architecture Overview.