Skip to content

Configuration

Daedalus has two kinds of configuration:

  1. Pipeline configuration — layered YAML files under config/ that define feature dimensions, the months to process, which views to load, the output schema, and the rolling-aggregation specs. These drive every training run.
  2. Agent CLI configuration — a small TOML file (~/.daedalus/config.toml) that holds operator-facing defaults for the daeda CLI and platform surfaces.

All of it is loaded through Pydantic Settings with YAML sources (SettingsConfigDict(yaml_file=...)), so the YAML files are the source of truth.

The layered pipeline config

Pipeline configuration is split across three files, each with a clear job:

FileOwns
config/base.yamlFeature dimensions, feature lists, the default feature S3 root
config/training/runtime.yamlDate ranges, chunking, feature_refs, output_columns, the compute engine, day parallelism
config/training/aggregations.yamlaggregation_specs, enrichment_specs, embedding_stores

config/base.yaml — project-level constants

Feature dimensions, the canonical feature lists, and a default feature S3 root:

yaml
max_like_artworks_len: 32       # max user like-interaction length
artwork_embed_dim: 1152         # artwork embedding dimension
max_followed_users_len: 32      # max followed-users length

# Default feature S3 root (a convenience prefix for parquet sources)
feature_s3_root_dir: "s3://pixai-rec-sys/infra"

It also defines the click_event_features, artwork_features, and user_features column lists used across the pipeline.

config/training/runtime.yaml — the run knobs

This is the file you tune per run. The key fields:

yaml
# Convenience root for repo-relative parquet source paths. Data is read in
# place from each feature view's configured source (Postgres / S3 parquet /
# DuckLake / Snowflake) — this is not a download/staging directory. A local
# parquet dir (e.g. feed/, user/, ...) is simply one direct source.
data_root: data/mewtant

# Inclusive YYYYMM month range to process (the pipeline iterates every
# month from start through end)
feed_start: "202602"
feed_end: "202605"

# Rows per output chunk (CLI-compat; the staged SQL skeleton does not
# chunk in Python)
entity_spine_chunk_rows: 500000

# Target days processed concurrently — keep at 1 unless the host has the
# CPU/memory/spill headroom for N independent daily pipelines
day_workers: 1

# Rolling-aggregation engine
compute:
    engine: duckdb                # `duckdb` (default) or `polars`
    duckdb_config:
        threads: "4"
        memory_limit: "32GiB"
        temp_directory: "/data/jiacheng/cache/duckdb"
        preserve_insertion_order: "false"

# Output directory for per-day parquets
output_root: data/training_output

feature_refs — which views to load

Each entry is view_name:column_name. All columns defined in the view are loaded — the column name only selects which views are pulled in:

yaml
feature_refs:
    - user_profile:followed_user_ids
    - artwork_properties:author_id
    - artwork_generation:model_id
    - artwork_aesthetic:aes_score
    - artwork_tack:tack_ids
    - artwork_vector:media_safety_score

output_columns — the output schema

Columns written to the per-day parquets. Entries given as a {name, default} mapping are filled with that literal when the column is absent — so the schema stays consistent even before the enrichment operators run. Dynamic embedding columns use default_sql to emit a typed NULL placeholder:

yaml
output_columns:
    - name: user_id
      default: 0
    - name: artwork_id
      default: 0
    - event_timestamp          # bare entry: no default
    - event
    # ... rolling aggregations, user/artwork features ...
    - name: image_embedding
      default_sql: "CAST(NULL AS FLOAT[])"
    - name: like_artwork_avg_embeds
      default_sql: "CAST(NULL AS FLOAT[])"

Engine choice

compute.engine selects the rolling-aggregation engine. The DuckDB and Polars implementations are bit-for-bit equivalent on every production spec, so the choice is about performance and A/B testing, not correctness. compute.duckdb_config is forwarded to duckdb.connect(config=...); pass explicit threads / memory_limit when the host's /proc over-reports the cgroup (common on shared devpods).

config/training/aggregations.yaml — the aggregation specs

This file defines the rolling-aggregation, enrichment, and embedding-store specs.

  • aggregation_specs — per-feature rolling aggregations. Each spec sets a group_by, value_column, operation (list / count / flatten_distinct), timestamp_column, window_days, an optional condition, max_items, and exclude_self:

    yaml
    aggregation_specs:
        - name: like_artwork_ids_7d
          group_by: ["user_id"]
          value_column: artwork_id
          operation: list
          timestamp_column: event_timestamp
          include_current: false
          window_days: 7
          max_items: 32
          condition: "hist.event = 'like_click'"
          exclude_self: true
  • enrichment_specs — dynamic features computed by the enrichment (pythonic/Ray) operators. Each points a source_id_column at an embedding_store:

    yaml
    enrichment_specs:
        - name: like_artwork_avg_embeds
          source_id_column: like_artwork_ids_7d
          store_name: artwork_embeddings
          udf_name: avg_pool_embeddings_udf
        - name: image_embedding
          source_id_column: artwork_id
          store_name: artwork_embeddings
          udf_name: lookup_embedding_udf
  • embedding_stores — Lance store definitions:

    yaml
    embedding_stores:
        - name: artwork_embeddings
          type: lance
          path: data/store/v6/artwork_embedding
          id_column: id
          embedding_column: image_embedding
          embedding_dim: 1152
          embedding_dtype: float16
          build_batch_size: 50000
          source_parquet: data/mewtant/siglip2_vectors

Overriding config paths

daeda pipeline train resolves these files by default but accepts overrides:

bash
daeda pipeline train dssm_ranking \
    --runtime-config-path config/training/runtime.yaml \
    --aggregation-config-path config/training/aggregations.yaml \
    --output-root /path/to/output \
    --day-workers 4

Agent CLI config

The daeda config command group persists operator-facing CLI defaults to a TOML file. By default this lives at ~/.daedalus/config.toml; override the location with the $DAEDALUS_CONFIG environment variable.

bash
daeda config init          # write the default config (use --force to overwrite)
daeda config show          # print the effective config (--format json for JSON)
daeda config set <key> <value>
daeda config path          # print the resolved config file path

The supported keys (with their defaults) are:

KeyDefault
dagster_graphql_urlhttp://localhost:3000/graphql
api_urlhttp://localhost:8000
feature_views_dirfeature_views
feature_services_dirfeature_services
default_servicedssm_ranking
output_rootdata/training_output

For example, to point the CLI at a different feature-service directory:

bash
daeda config set feature_services_dir /path/to/feature_services

Secrets

Never inline credentials in config or source YAML. Data sources are a config-only choice of backend (parquet, DuckLake, Postgres, Snowflake), and all credentials are supplied via ${ENV} references resolved from the environment — not written into the YAML. See Data Sources Overview for the per-backend recipes.

WARNING

Credentials referenced as ${ENV} are read from the process environment at run time. Keep them out of the repo and out of ~/.daedalus/config.toml (which holds only non-secret operator defaults).