Feature Catalog
The feature catalog is the declarative spine of Daedalus. Feature views are defined as YAML in feature_views/*.yaml and loaded into an in-memory registry by FeatureCatalog.from_yaml_dir() (src/daedalus/catalog/registry.py). Adding a feature is a YAML edit — there is no Python registration step.
Loading views from YAML
FeatureCatalog.from_yaml_dir(definitions_dir) globs every *.yaml in the directory (sorted), parses each through load_feature_view_from_yaml (src/daedalus/catalog/loader.py), and returns a FeatureCatalog holding the shared entity map plus one FeatureViewDef per file.
from pathlib import Path
from daedalus.catalog.registry import FeatureCatalog
catalog = FeatureCatalog.from_yaml_dir(Path("feature_views"))
catalog.list_views() # all FeatureViewDef objects
catalog.get_view("feed_events")
catalog.list_entities() # the shared FeatureEntity objects
catalog.to_dict() # JSON-serializable, for the API / UIThe loader keeps the wire-format structs (FeatureViewConfig, SourceConfig, FieldConfig — msgspec.Struct for fast deserialization) separate from the domain model (FeatureViewDef, DataSourceDef, FeatureField, FeatureEntity in src/daedalus/catalog/model.py). The model classes are plain dataclasses with to_dict() for serialization.
A real feature view
feature_views/feed_events.yaml defines the user–artwork interaction spine (abridged):
name: feed_events
entity: [user, artwork]
description: "User-artwork feed interactions (ods_feed_spine_d schema)"
ttl: 90d
online: false
source:
# Directory path = recursive parquet glob with hive partitioning.
path: "data/mewtant/feed"
timestamp_field: event_timestamp
fields:
- name: user_id
dtype: BIGINT
- name: artwork_id
dtype: BIGINT
- name: event
dtype: TEXT
- name: event_timestamp
dtype: BIGINT
- name: aes_score
dtype: DOUBLE
- name: tack_ids
dtype: BIGINT[] # variable-length list → list<int64>A view with a single entity and a field_mapping (renaming the source's id column to the canonical artwork_id) looks like artwork_vector.yaml:
name: artwork_vector
entity: artwork
ttl: 365d
online: true
source:
path: "data/mewtant/siglip2_vectors"
timestamp_field: created_at
field_mapping:
id: artwork_id
fields:
- name: media_id
dtype: BIGINTDtype parsing → Arrow types
Each field's dtype is a DuckDB-style type string parsed into a pyarrow.DataType by parse_dtype (src/daedalus/catalog/types.py). The catalog is Arrow-native end to end, so the dtype string is the only place DuckDB type names appear in a definition.
parse_dtype handles scalars, parameterized scalars (parameters are stripped), variable-length arrays, fixed-size arrays, and LIST(...) wrappers:
| YAML dtype | Parsed Arrow type |
|---|---|
FLOAT | float32 |
DOUBLE | float64 |
BIGINT | int64 |
INTEGER / INT | int32 |
VARCHAR / TEXT / STRING | string |
BOOLEAN | bool_ |
TIMESTAMP | timestamp("us") |
VARCHAR(255) | string (parameters stripped) |
FLOAT[] / DOUBLE[] | list_(float32) / list_(float64) |
LIST(INT) | list_(int32) |
BIGINT[] | list_(int64) |
FLOAT[1152] | list_(float32, 1152) (fixed-size list) |
import pyarrow as pa
from daedalus.catalog.types import parse_dtype
parse_dtype("BIGINT") # DataType(int64)
parse_dtype("FLOAT[]") # ListType(list<item: float>)
parse_dtype("LIST(INT)") # ListType(list<item: int32>)
parse_dtype("FLOAT[1152]") # FixedSizeListType(fixed_size_list<item: float>[1152])Fixed-size arrays vs. variable-length arrays
FLOAT[1152] (fixed-size) is matched before the [] (variable-length) suffix, so a dimensioned suffix is never misread as an unknown type. The dimension must be a bare positive integer — FLOAT[1.5], FLOAT[abc], and FLOAT[] with a blank size all raise ValueError. An unrecognized base type also raises ValueError rather than silently passing through.
The reverse direction — dtype_to_str — renders an Arrow type back to a human-readable DuckDB-style string for the API / UI. FixedSizeListType is tested before ListType because it is not a subclass of it.
Entities and join keys
Entities are defined once and shared across all views (the ENTITIES map in registry.py). A view declares entity: <name> or entity: [<name>, ...]; the loader resolves each name against the shared map and attaches the FeatureEntity objects, so a view never re-invents its join keys.
| Entity | Join key | Value type |
|---|---|---|
user | user_id | int64 |
artwork | artwork_id | int64 |
FeatureViewDef.join_keys flattens the join keys of all attached entities, and entity_names lists their names. See the architecture overview for why this is a load-bearing convention.
DataSourceDef — where the data lives
Every view's source block becomes a DataSourceDef. Beyond the parquet path, a source may attach a real backend (DuckLake / Postgres / Snowflake) and reference a table_name or query:
| Field | Meaning |
|---|---|
path | Parquet directory (recursive glob, hive-partitioned) |
timestamp_field | The event-time column (default event_timestamp) |
field_mapping | Rename source columns to canonical names (e.g. id → artwork_id) |
table_name / query | Attached-backend relation or SQL |
database_path | :memory: (default), a DuckLake DSN, or snowflake://… |
data_path, catalog_alias, s3_* | DuckLake DATA_PATH, attach alias, object-store credentials |
snowflake_* | Snowflake account / warehouse / key-pair JWT auth |
Secrets are never inlined
Credential fields (s3_secret, snowflake_private_key, …) are supplied as ${ENV_VAR} references, expanded at source-build time. to_dict() masks connection strings and secrets (mask_conn / mask_secret) so the catalog can be serialized for the API / UI without leaking credentials.
The polymorphic source backends (parquet, DuckLake, Snowflake, Postgres) are config-only and covered in Data Sources.
Feature services — column-level contracts
Where a view describes a source's columns, a feature service (src/daedalus/catalog/service.py) describes a model's resolved input schema: an ordered, column-level contract over feature-view columns, pipeline-derived columns, and event-spine context columns. A column reference is spelled view:column (FeatureColumnRef.from_string), and each column carries a shape kind (scalar / sequence / embedding), null semantics, and a sensitivity.
The service is what the operator pipeline compiles from: it decides which views to scan, which rolling aggregations and point-in-time lookups to emit, and which embedding columns to enrich. See Operator Pipeline.