Configuration

Daedalus has two kinds of configuration:

Pipeline configuration — layered YAML files under config/ that define feature dimensions, the months to process, which views to load, the output schema, and the rolling-aggregation specs. These drive every training run.
Agent CLI configuration — a small TOML file (~/.daedalus/config.toml) that holds operator-facing defaults for the daeda CLI and platform surfaces.

All of it is loaded through Pydantic Settings with YAML sources (SettingsConfigDict(yaml_file=...)), so the YAML files are the source of truth.

The layered pipeline config

Pipeline configuration is split across three files, each with a clear job:

File	Owns
`config/base.yaml`	Feature dimensions, feature lists, the default feature S3 root
`config/training/runtime.yaml`	Date ranges, chunking, `feature_refs`, `output_columns`, the compute engine, day parallelism
`config/training/aggregations.yaml`	`aggregation_specs`, `enrichment_specs`, `embedding_stores`

`config/base.yaml` — project-level constants

Feature dimensions, the canonical feature lists, and a default feature S3 root:

yaml

max_like_artworks_len: 32       # max user like-interaction length
artwork_embed_dim: 1152         # artwork embedding dimension
max_followed_users_len: 32      # max followed-users length

# Default feature S3 root (a convenience prefix for parquet sources)
feature_s3_root_dir: "s3://pixai-rec-sys/infra"

It also defines the click_event_features, artwork_features, and user_features column lists used across the pipeline.

`config/training/runtime.yaml` — the run knobs

This is the file you tune per run. The key fields:

yaml

# Convenience root for repo-relative parquet source paths. Data is read in
# place from each feature view's configured source (Postgres / S3 parquet /
# DuckLake / Snowflake) — this is not a download/staging directory. A local
# parquet dir (e.g. feed/, user/, ...) is simply one direct source.
data_root: data/mewtant

# Inclusive YYYYMM month range to process (the pipeline iterates every
# month from start through end)
feed_start: "202602"
feed_end: "202605"

# Rows per output chunk (CLI-compat; the staged SQL skeleton does not
# chunk in Python)
entity_spine_chunk_rows: 500000

# Target days processed concurrently — keep at 1 unless the host has the
# CPU/memory/spill headroom for N independent daily pipelines
day_workers: 1

# Rolling-aggregation engine
compute:
    engine: duckdb                # `duckdb` (default) or `polars`
    duckdb_config:
        threads: "4"
        memory_limit: "32GiB"
        temp_directory: "/data/jiacheng/cache/duckdb"
        preserve_insertion_order: "false"

# Output directory for per-day parquets
output_root: data/training_output

`feature_refs` — which views to load

Each entry is view_name:column_name. All columns defined in the view are loaded — the column name only selects which views are pulled in:

yaml

feature_refs:
    - user_profile:followed_user_ids
    - artwork_properties:author_id
    - artwork_generation:model_id
    - artwork_aesthetic:aes_score
    - artwork_tack:tack_ids
    - artwork_vector:media_safety_score

`output_columns` — the output schema

Columns written to the per-day parquets. Entries given as a {name, default} mapping are filled with that literal when the column is absent — so the schema stays consistent even before the enrichment operators run. Dynamic embedding columns use default_sql to emit a typed NULL placeholder:

yaml

output_columns:
    - name: user_id
      default: 0
    - name: artwork_id
      default: 0
    - event_timestamp          # bare entry: no default
    - event
    # ... rolling aggregations, user/artwork features ...
    - name: image_embedding
      default_sql: "CAST(NULL AS FLOAT[])"
    - name: like_artwork_avg_embeds
      default_sql: "CAST(NULL AS FLOAT[])"

Engine choice

compute.engine selects the rolling-aggregation engine. The DuckDB and Polars implementations are bit-for-bit equivalent on every production spec, so the choice is about performance and A/B testing, not correctness. compute.duckdb_config is forwarded to duckdb.connect(config=...); pass explicit threads / memory_limit when the host's /proc over-reports the cgroup (common on shared devpods).

`config/training/aggregations.yaml` — the aggregation specs

This file defines the rolling-aggregation, enrichment, and embedding-store specs.

aggregation_specs — per-feature rolling aggregations. Each spec sets a group_by, value_column, operation (list / count / flatten_distinct), timestamp_column, window_days, an optional condition, max_items, and exclude_self:

yaml

aggregation_specs:
    - name: like_artwork_ids_7d
      group_by: ["user_id"]
      value_column: artwork_id
      operation: list
      timestamp_column: event_timestamp
      include_current: false
      window_days: 7
      max_items: 32
      condition: "hist.event = 'like_click'"
      exclude_self: true

enrichment_specs — dynamic features computed by the enrichment (pythonic/Ray) operators. Each points a source_id_column at an embedding_store:

yaml

enrichment_specs:
    - name: like_artwork_avg_embeds
      source_id_column: like_artwork_ids_7d
      store_name: artwork_embeddings
      udf_name: avg_pool_embeddings_udf
    - name: image_embedding
      source_id_column: artwork_id
      store_name: artwork_embeddings
      udf_name: lookup_embedding_udf

embedding_stores — Lance store definitions:

yaml

embedding_stores:
    - name: artwork_embeddings
      type: lance
      path: data/store/v6/artwork_embedding
      id_column: id
      embedding_column: image_embedding
      embedding_dim: 1152
      embedding_dtype: float16
      build_batch_size: 50000
      source_parquet: data/mewtant/siglip2_vectors

Overriding config paths

daeda pipeline train resolves these files by default but accepts overrides:

bash

daeda pipeline train dssm_ranking \
    --runtime-config-path config/training/runtime.yaml \
    --aggregation-config-path config/training/aggregations.yaml \
    --output-root /path/to/output \
    --day-workers 4

Agent CLI config

The daeda config command group persists operator-facing CLI defaults to a TOML file. By default this lives at ~/.daedalus/config.toml; override the location with the $DAEDALUS_CONFIG environment variable.

bash

daeda config init          # write the default config (use --force to overwrite)
daeda config show          # print the effective config (--format json for JSON)
daeda config set <key> <value>
daeda config path          # print the resolved config file path

The supported keys (with their defaults) are:

Key	Default
`dagster_graphql_url`	`http://localhost:3000/graphql`
`api_url`	`http://localhost:8000`
`feature_views_dir`	`feature_views`
`feature_services_dir`	`feature_services`
`default_service`	`dssm_ranking`
`output_root`	`data/training_output`

For example, to point the CLI at a different feature-service directory:

bash

daeda config set feature_services_dir /path/to/feature_services

Secrets

Never inline credentials in config or source YAML. Data sources are a config-only choice of backend (parquet, DuckLake, Postgres, Snowflake), and all credentials are supplied via ${ENV} references resolved from the environment — not written into the YAML. See Data Sources Overview for the per-backend recipes.

WARNING

Credentials referenced as ${ENV} are read from the process environment at run time. Keep them out of the repo and out of ~/.daedalus/config.toml (which holds only non-secret operator defaults).

Configuration ​

The layered pipeline config ​

config/base.yaml — project-level constants ​

config/training/runtime.yaml — the run knobs ​

feature_refs — which views to load ​

output_columns — the output schema ​

config/training/aggregations.yaml — the aggregation specs ​

Overriding config paths ​

Agent CLI config ​

Secrets ​

Configuration

The layered pipeline config

`config/base.yaml` — project-level constants

`config/training/runtime.yaml` — the run knobs

`feature_refs` — which views to load

`output_columns` — the output schema

`config/training/aggregations.yaml` — the aggregation specs

Overriding config paths

Agent CLI config

Secrets