Overview

Training in Daedalus is one operator-pipeline DAG — compile → skeleton → enrich — run through the executor with a single command:

bash

daeda pipeline train dssm_ranking

This is the sole training entry point (Core v1). It compiles the feature service feature_services/dssm_ranking.yaml into an engine-tagged operator DAG, then runs the SQL source/skeleton stage (feature joins + rolling aggregations) followed by the Pythonic enrich stage (Ray + Lance avg-pooled embeddings).

Single entry point

The standalone daeda skeleton, daeda enrich, and daeda enrich-shards commands were removed at v0.7.1 (the Core-v1 lean cut). daeda pipeline train wraps the same proven engine entry points, so its output is byte-identical to the old dual path.

The two stages

Stage	Engine tag	Entry point	What it does
Skeleton (source layer)	`sql`	`run_skeleton_stage`	Per-day feature joins + rolling aggregations → per-day parquet
Enrich (dynamic features)	`pythonic`	`run_enrich_stage`	Appends avg-pooled artwork embeddings via Ray Data + Lance

Each compiled operator records its declared engine — sql, sql_arrow_udf, or pythonic — so the DAG is fully transparent and hand-tunable before it runs.

Compile: emit an editable DAG

Inspect or hand-tune the compiled DAG before running it. compile writes <output-dir>/<service>.generated.yaml (default output dir config/pipelines/) and echoes the YAML to stdout:

bash

daeda pipeline compile dssm_ranking            # write + echo the generated DAG
daeda pipeline compile dssm_ranking --no-write # echo only, write nothing

To override an operator's engine, hand-edit <output-dir>/<service>.yaml; <service>.generated.yaml is always the regenerated artifact.

Train: run the pipeline

bash

# Both stages, full configured month range
daeda pipeline train dssm_ranking

# Single target day, both stages
daeda pipeline train dssm_ranking --target-date 2026-05-15

# One stage at a time (resumable against an existing skeleton output)
daeda pipeline train dssm_ranking --skeleton-only --target-date 2026-05-15
daeda pipeline train dssm_ranking --enrich-only   --target-date 2026-05-15

# Concurrent target days (needs CPU / memory / DuckDB-spill headroom)
daeda pipeline train dssm_ranking --day-workers 4

# Override config + output locations, size the Ray envelope
daeda pipeline train dssm_ranking \
    --runtime-config-path config/training/runtime.yaml \
    --aggregation-config-path config/training/aggregations.yaml \
    --output-root /path/to/output \
    --ray-num-cpus 16

`daeda pipeline train` flags

Flag	Default	Description
`SERVICE`	(required)	Feature service name (e.g. `dssm_ranking`)
`--feature-views-dir`	`feature_views`	Directory of feature view YAML
`--feature-services-dir`	`feature_services`	Directory of feature service YAML
`--runtime-config-path`	`config/training/runtime.yaml`	Training runtime config
`--aggregation-config-path`	`config/training/aggregations.yaml`	Aggregation + embedding-store config
`--output-root`	(from config)	Override the skeleton output root
`--target-date`	(all days)	Narrow a run to a single `YYYY-MM-DD` day
`--day-workers`	(from config, `1`)	Concurrent target days
`--skeleton-only`	off	Run the source stage only
`--enrich-only`	off	Run the enrichment stage only
`--ray-num-cpus`	`16`	Ray logical CPUs for the enrich stage

The processed month range (feed_start / feed_end) and chunk sizes are config-driven in config/training/runtime.yaml; --target-date narrows a run to a single day.

Output layout

Both stages write under output_root (default data/training_output/, or data/mewtant/training_output/ per the documented production default):

Path	Written by	Description
`dt=YYYY-MM-DD/part-N-0.parquet`	skeleton	Per-day Zstd parquet partitions, sorted by `event_timestamp`
`qid_map.parquet`	skeleton	Session → incremental integer `qid` map (grows across runs; only when feed carries `feed_session`)
`<root>_enriched/dt=YYYY-MM-DD/*.parquet`	enrich	Sharded, chronologically contiguous parquet with the embedding columns appended

The enrich stage skips days already present in its output directory, so a failed run is safe to re-run.

Skeleton Stage — the SQL source layer: spine build, feature joins, rolling aggregations, per-day output.
Enrich Stage — the Pythonic stage: Ray Data streaming actors over a Lance embedding store.

Overview ​

The two stages ​

Compile: emit an editable DAG ​

Train: run the pipeline ​

daeda pipeline train flags ​

Output layout ​

Next ​

Overview

The two stages

Compile: emit an editable DAG

Train: run the pipeline

`daeda pipeline train` flags

Output layout

Next