Getting Started
This page walks you from a fresh checkout to your first training run.
Prerequisites
- Python 3.13+ — the project requires
>=3.13(pinned in.python-versionat the repo root). - uv — the package and project manager used for every command below.
Install
Install the project and its dev dependencies (pytest, ruff):
uv sync --devThis installs the core engine. Both optional backends import lazily, so a plain import daedalus never requires them — install an extra only when you actually use that backend.
Optional extras
# Unlocks catalog.table.SnowflakeSource:
# key-pair JWT auth + Arrow query pushdown
uv sync --extra snowflake# Unlocks the MLflowTracker (mlflow-skinny) projection of training runs
uv sync --extra mlflowsnowflake— pulls insnowflake-connector-python+cryptographyso a data source can be backed by Snowflake (snowflake://<account>with key-pair JWT auth). Without it, the Snowflake source stays unimportable but the rest of Daedalus works.mlflow— pulls inmlflow-skinnyso eachTrainingRunRecord(feature service @ version, git SHA, config paths) can be projected to MLflow. Tracking is off by default (a no-opNullTracker); enable it by settingTrainingRunConfig.tracking.kind="mlflow".
Link your data
Feature sources expect data under data/mewtant/ — typically a symlink to external storage. The directory holds one subdir per feature group: aesthetic/, feed/, generation/, likes/, properties/, siglip2_vectors/, tack/, and user/.
First, download the monthly feature parquets from S3 with daeda download (a YYYYMM range, comma-separated feature list, and an output directory):
daeda download \
--features properties,feed,generation,siglip2_vectors \
--start 202601 \
--end 202605 \
--output-dir /path/to/spacious_storagePreview before you pull
Add --dry-run to print the generated sync commands plus a per-feature coverage report (present / missing / unknown prefixes) without downloading anything. daeda download autodetects the transfer backend, preferring s5cmd and falling back to aws s3 sync; force one with --backend s5cmd|aws.
Then symlink the downloaded directory into the project so the feature definitions can find it:
unlink data/mewtant
ln -sf /path/to/spacious_storage data/mewtantFirst commands
Inspect the feature catalog
Start by browsing what features exist — no Python required:
# List every feature view with a one-line summary
uv run daeda catalog list
# Print a view's metadata + schema as JSON
uv run daeda catalog show user_profileRun a training pipeline
Training is a single canonical path (Core v1): daeda pipeline train <service> compiles the feature service into an operator DAG and runs the unified skeleton → enrich pipeline.
# Full run over the month range configured in runtime.yaml
uv run daeda pipeline train dssm_rankingUseful flags for iterating:
# Narrow to a single day (YYYY-MM-DD)
uv run daeda pipeline train dssm_ranking --target-date 2026-05-15
# Run just one stage
uv run daeda pipeline train dssm_ranking --skeleton-only --target-date 2026-05-15
uv run daeda pipeline train dssm_ranking --enrich-only --target-date 2026-05-15The processed month range (feed_start / feed_end) and other knobs are config-driven — see Configuration. The available services are dssm_ranking (DSSM retrieval) and xgb_reranker (XGBoost reranker).
Enrich needs the Lance store
The enrich stage reads avg-pooled embeddings from a Lance store (data/store/v6/artwork_embedding by default). It is built automatically on first run from data/mewtant/siglip2_vectors, and must reside on a local ext4 filesystem — GPFS / network mounts are not supported (Lance atomic-rename limitation).
Running tests
Run the full suite:
uv run pytestTests are auto-marked by directory (see conftest.py). Run a subset by marker:
uv run pytest -m unit # fast unit tests
uv run pytest -m integration # integration tests (slow)
uv run pytest -m cli # CLI behavior tests
uv run pytest -m "unit and not slow"Or target a single file or test:
uv run pytest tests/catalog/test_loader.py
uv run pytest tests/catalog/test_loader.py::test_load_feature_view_from_yamlLint and format with ruff (line length 88):
uv run ruff check .
uv run ruff format .Next steps
- Configuration — the layered YAML config model and the agent CLI config.
- Architecture Overview — how the catalog, engine, and pipeline fit together.