Skip to content

Overview

Training in Daedalus is one operator-pipeline DAGcompile → skeleton → enrich — run through the executor with a single command:

bash
daeda pipeline train dssm_ranking

This is the sole training entry point (Core v1). It compiles the feature service feature_services/dssm_ranking.yaml into an engine-tagged operator DAG, then runs the SQL source/skeleton stage (feature joins + rolling aggregations) followed by the Pythonic enrich stage (Ray + Lance avg-pooled embeddings).

Single entry point

The standalone daeda skeleton, daeda enrich, and daeda enrich-shards commands were removed at v0.7.1 (the Core-v1 lean cut). daeda pipeline train wraps the same proven engine entry points, so its output is byte-identical to the old dual path.

The two stages

StageEngine tagEntry pointWhat it does
Skeleton (source layer)sqlrun_skeleton_stagePer-day feature joins + rolling aggregations → per-day parquet
Enrich (dynamic features)pythonicrun_enrich_stageAppends avg-pooled artwork embeddings via Ray Data + Lance

Each compiled operator records its declared engine — sql, sql_arrow_udf, or pythonic — so the DAG is fully transparent and hand-tunable before it runs.

Compile: emit an editable DAG

Inspect or hand-tune the compiled DAG before running it. compile writes <output-dir>/<service>.generated.yaml (default output dir config/pipelines/) and echoes the YAML to stdout:

bash
daeda pipeline compile dssm_ranking            # write + echo the generated DAG
daeda pipeline compile dssm_ranking --no-write # echo only, write nothing

To override an operator's engine, hand-edit <output-dir>/<service>.yaml; <service>.generated.yaml is always the regenerated artifact.

Train: run the pipeline

bash
# Both stages, full configured month range
daeda pipeline train dssm_ranking

# Single target day, both stages
daeda pipeline train dssm_ranking --target-date 2026-05-15

# One stage at a time (resumable against an existing skeleton output)
daeda pipeline train dssm_ranking --skeleton-only --target-date 2026-05-15
daeda pipeline train dssm_ranking --enrich-only   --target-date 2026-05-15

# Concurrent target days (needs CPU / memory / DuckDB-spill headroom)
daeda pipeline train dssm_ranking --day-workers 4

# Override config + output locations, size the Ray envelope
daeda pipeline train dssm_ranking \
    --runtime-config-path config/training/runtime.yaml \
    --aggregation-config-path config/training/aggregations.yaml \
    --output-root /path/to/output \
    --ray-num-cpus 16

daeda pipeline train flags

FlagDefaultDescription
SERVICE(required)Feature service name (e.g. dssm_ranking)
--feature-views-dirfeature_viewsDirectory of feature view YAML
--feature-services-dirfeature_servicesDirectory of feature service YAML
--runtime-config-pathconfig/training/runtime.yamlTraining runtime config
--aggregation-config-pathconfig/training/aggregations.yamlAggregation + embedding-store config
--output-root(from config)Override the skeleton output root
--target-date(all days)Narrow a run to a single YYYY-MM-DD day
--day-workers(from config, 1)Concurrent target days
--skeleton-onlyoffRun the source stage only
--enrich-onlyoffRun the enrichment stage only
--ray-num-cpus16Ray logical CPUs for the enrich stage

The processed month range (feed_start / feed_end) and chunk sizes are config-driven in config/training/runtime.yaml; --target-date narrows a run to a single day.

Output layout

Both stages write under output_root (default data/training_output/, or data/mewtant/training_output/ per the documented production default):

PathWritten byDescription
dt=YYYY-MM-DD/part-N-0.parquetskeletonPer-day Zstd parquet partitions, sorted by event_timestamp
qid_map.parquetskeletonSession → incremental integer qid map (grows across runs; only when feed carries feed_session)
<root>_enriched/dt=YYYY-MM-DD/*.parquetenrichSharded, chronologically contiguous parquet with the embedding columns appended

The enrich stage skips days already present in its output directory, so a failed run is safe to re-run.

Next

  • Skeleton Stage — the SQL source layer: spine build, feature joins, rolling aggregations, per-day output.
  • Enrich Stage — the Pythonic stage: Ray Data streaming actors over a Lance embedding store.