Configuration
Daedalus has two kinds of configuration:
- Pipeline configuration — layered YAML files under
config/that define feature dimensions, the months to process, which views to load, the output schema, and the rolling-aggregation specs. These drive every training run. - Agent CLI configuration — a small TOML file (
~/.daedalus/config.toml) that holds operator-facing defaults for thedaedaCLI and platform surfaces.
All of it is loaded through Pydantic Settings with YAML sources (SettingsConfigDict(yaml_file=...)), so the YAML files are the source of truth.
The layered pipeline config
Pipeline configuration is split across three files, each with a clear job:
| File | Owns |
|---|---|
config/base.yaml | Feature dimensions, feature lists, the default feature S3 root |
config/training/runtime.yaml | Date ranges, chunking, feature_refs, output_columns, the compute engine, day parallelism |
config/training/aggregations.yaml | aggregation_specs, enrichment_specs, embedding_stores |
config/base.yaml — project-level constants
Feature dimensions, the canonical feature lists, and a default feature S3 root:
max_like_artworks_len: 32 # max user like-interaction length
artwork_embed_dim: 1152 # artwork embedding dimension
max_followed_users_len: 32 # max followed-users length
# Default feature S3 root (a convenience prefix for parquet sources)
feature_s3_root_dir: "s3://pixai-rec-sys/infra"It also defines the click_event_features, artwork_features, and user_features column lists used across the pipeline.
config/training/runtime.yaml — the run knobs
This is the file you tune per run. The key fields:
# Convenience root for repo-relative parquet source paths. Data is read in
# place from each feature view's configured source (Postgres / S3 parquet /
# DuckLake / Snowflake) — this is not a download/staging directory. A local
# parquet dir (e.g. feed/, user/, ...) is simply one direct source.
data_root: data/mewtant
# Inclusive YYYYMM month range to process (the pipeline iterates every
# month from start through end)
feed_start: "202602"
feed_end: "202605"
# Rows per output chunk (CLI-compat; the staged SQL skeleton does not
# chunk in Python)
entity_spine_chunk_rows: 500000
# Target days processed concurrently — keep at 1 unless the host has the
# CPU/memory/spill headroom for N independent daily pipelines
day_workers: 1
# Rolling-aggregation engine
compute:
engine: duckdb # `duckdb` (default) or `polars`
duckdb_config:
threads: "4"
memory_limit: "32GiB"
temp_directory: "/data/jiacheng/cache/duckdb"
preserve_insertion_order: "false"
# Output directory for per-day parquets
output_root: data/training_outputfeature_refs — which views to load
Each entry is view_name:column_name. All columns defined in the view are loaded — the column name only selects which views are pulled in:
feature_refs:
- user_profile:followed_user_ids
- artwork_properties:author_id
- artwork_generation:model_id
- artwork_aesthetic:aes_score
- artwork_tack:tack_ids
- artwork_vector:media_safety_scoreoutput_columns — the output schema
Columns written to the per-day parquets. Entries given as a {name, default} mapping are filled with that literal when the column is absent — so the schema stays consistent even before the enrichment operators run. Dynamic embedding columns use default_sql to emit a typed NULL placeholder:
output_columns:
- name: user_id
default: 0
- name: artwork_id
default: 0
- event_timestamp # bare entry: no default
- event
# ... rolling aggregations, user/artwork features ...
- name: image_embedding
default_sql: "CAST(NULL AS FLOAT[])"
- name: like_artwork_avg_embeds
default_sql: "CAST(NULL AS FLOAT[])"Engine choice
compute.engine selects the rolling-aggregation engine. The DuckDB and Polars implementations are bit-for-bit equivalent on every production spec, so the choice is about performance and A/B testing, not correctness. compute.duckdb_config is forwarded to duckdb.connect(config=...); pass explicit threads / memory_limit when the host's /proc over-reports the cgroup (common on shared devpods).
config/training/aggregations.yaml — the aggregation specs
This file defines the rolling-aggregation, enrichment, and embedding-store specs.
aggregation_specs— per-feature rolling aggregations. Each spec sets agroup_by,value_column,operation(list/count/flatten_distinct),timestamp_column,window_days, an optionalcondition,max_items, andexclude_self:yamlaggregation_specs: - name: like_artwork_ids_7d group_by: ["user_id"] value_column: artwork_id operation: list timestamp_column: event_timestamp include_current: false window_days: 7 max_items: 32 condition: "hist.event = 'like_click'" exclude_self: trueenrichment_specs— dynamic features computed by the enrichment (pythonic/Ray) operators. Each points asource_id_columnat anembedding_store:yamlenrichment_specs: - name: like_artwork_avg_embeds source_id_column: like_artwork_ids_7d store_name: artwork_embeddings udf_name: avg_pool_embeddings_udf - name: image_embedding source_id_column: artwork_id store_name: artwork_embeddings udf_name: lookup_embedding_udfembedding_stores— Lance store definitions:yamlembedding_stores: - name: artwork_embeddings type: lance path: data/store/v6/artwork_embedding id_column: id embedding_column: image_embedding embedding_dim: 1152 embedding_dtype: float16 build_batch_size: 50000 source_parquet: data/mewtant/siglip2_vectors
Overriding config paths
daeda pipeline train resolves these files by default but accepts overrides:
daeda pipeline train dssm_ranking \
--runtime-config-path config/training/runtime.yaml \
--aggregation-config-path config/training/aggregations.yaml \
--output-root /path/to/output \
--day-workers 4Agent CLI config
The daeda config command group persists operator-facing CLI defaults to a TOML file. By default this lives at ~/.daedalus/config.toml; override the location with the $DAEDALUS_CONFIG environment variable.
daeda config init # write the default config (use --force to overwrite)
daeda config show # print the effective config (--format json for JSON)
daeda config set <key> <value>
daeda config path # print the resolved config file pathThe supported keys (with their defaults) are:
| Key | Default |
|---|---|
dagster_graphql_url | http://localhost:3000/graphql |
api_url | http://localhost:8000 |
feature_views_dir | feature_views |
feature_services_dir | feature_services |
default_service | dssm_ranking |
output_root | data/training_output |
For example, to point the CLI at a different feature-service directory:
daeda config set feature_services_dir /path/to/feature_servicesSecrets
Never inline credentials in config or source YAML. Data sources are a config-only choice of backend (parquet, DuckLake, Postgres, Snowflake), and all credentials are supplied via ${ENV} references resolved from the environment — not written into the YAML. See Data Sources Overview for the per-backend recipes.
WARNING
Credentials referenced as ${ENV} are read from the process environment at run time. Keep them out of the repo and out of ~/.daedalus/config.toml (which holds only non-secret operator defaults).