Overview
Training in Daedalus is one operator-pipeline DAG — compile → skeleton → enrich — run through the executor with a single command:
daeda pipeline train dssm_rankingThis is the sole training entry point (Core v1). It compiles the feature service feature_services/dssm_ranking.yaml into an engine-tagged operator DAG, then runs the SQL source/skeleton stage (feature joins + rolling aggregations) followed by the Pythonic enrich stage (Ray + Lance avg-pooled embeddings).
Single entry point
The standalone daeda skeleton, daeda enrich, and daeda enrich-shards commands were removed at v0.7.1 (the Core-v1 lean cut). daeda pipeline train wraps the same proven engine entry points, so its output is byte-identical to the old dual path.
The two stages
| Stage | Engine tag | Entry point | What it does |
|---|---|---|---|
| Skeleton (source layer) | sql | run_skeleton_stage | Per-day feature joins + rolling aggregations → per-day parquet |
| Enrich (dynamic features) | pythonic | run_enrich_stage | Appends avg-pooled artwork embeddings via Ray Data + Lance |
Each compiled operator records its declared engine — sql, sql_arrow_udf, or pythonic — so the DAG is fully transparent and hand-tunable before it runs.
Compile: emit an editable DAG
Inspect or hand-tune the compiled DAG before running it. compile writes <output-dir>/<service>.generated.yaml (default output dir config/pipelines/) and echoes the YAML to stdout:
daeda pipeline compile dssm_ranking # write + echo the generated DAG
daeda pipeline compile dssm_ranking --no-write # echo only, write nothingTo override an operator's engine, hand-edit <output-dir>/<service>.yaml; <service>.generated.yaml is always the regenerated artifact.
Train: run the pipeline
# Both stages, full configured month range
daeda pipeline train dssm_ranking
# Single target day, both stages
daeda pipeline train dssm_ranking --target-date 2026-05-15
# One stage at a time (resumable against an existing skeleton output)
daeda pipeline train dssm_ranking --skeleton-only --target-date 2026-05-15
daeda pipeline train dssm_ranking --enrich-only --target-date 2026-05-15
# Concurrent target days (needs CPU / memory / DuckDB-spill headroom)
daeda pipeline train dssm_ranking --day-workers 4
# Override config + output locations, size the Ray envelope
daeda pipeline train dssm_ranking \
--runtime-config-path config/training/runtime.yaml \
--aggregation-config-path config/training/aggregations.yaml \
--output-root /path/to/output \
--ray-num-cpus 16daeda pipeline train flags
| Flag | Default | Description |
|---|---|---|
SERVICE | (required) | Feature service name (e.g. dssm_ranking) |
--feature-views-dir | feature_views | Directory of feature view YAML |
--feature-services-dir | feature_services | Directory of feature service YAML |
--runtime-config-path | config/training/runtime.yaml | Training runtime config |
--aggregation-config-path | config/training/aggregations.yaml | Aggregation + embedding-store config |
--output-root | (from config) | Override the skeleton output root |
--target-date | (all days) | Narrow a run to a single YYYY-MM-DD day |
--day-workers | (from config, 1) | Concurrent target days |
--skeleton-only | off | Run the source stage only |
--enrich-only | off | Run the enrichment stage only |
--ray-num-cpus | 16 | Ray logical CPUs for the enrich stage |
The processed month range (feed_start / feed_end) and chunk sizes are config-driven in config/training/runtime.yaml; --target-date narrows a run to a single day.
Output layout
Both stages write under output_root (default data/training_output/, or data/mewtant/training_output/ per the documented production default):
| Path | Written by | Description |
|---|---|---|
dt=YYYY-MM-DD/part-N-0.parquet | skeleton | Per-day Zstd parquet partitions, sorted by event_timestamp |
qid_map.parquet | skeleton | Session → incremental integer qid map (grows across runs; only when feed carries feed_session) |
<root>_enriched/dt=YYYY-MM-DD/*.parquet | enrich | Sharded, chronologically contiguous parquet with the embedding columns appended |
The enrich stage skips days already present in its output directory, so a failed run is safe to re-run.
Next
- Skeleton Stage — the SQL source layer: spine build, feature joins, rolling aggregations, per-day output.
- Enrich Stage — the Pythonic stage: Ray Data streaming actors over a Lance embedding store.