Skip to content

Feature Catalog

The feature catalog is the declarative spine of Daedalus. Feature views are defined as YAML in feature_views/*.yaml and loaded into an in-memory registry by FeatureCatalog.from_yaml_dir() (src/daedalus/catalog/registry.py). Adding a feature is a YAML edit — there is no Python registration step.

Loading views from YAML

FeatureCatalog.from_yaml_dir(definitions_dir) globs every *.yaml in the directory (sorted), parses each through load_feature_view_from_yaml (src/daedalus/catalog/loader.py), and returns a FeatureCatalog holding the shared entity map plus one FeatureViewDef per file.

python
from pathlib import Path
from daedalus.catalog.registry import FeatureCatalog

catalog = FeatureCatalog.from_yaml_dir(Path("feature_views"))
catalog.list_views()          # all FeatureViewDef objects
catalog.get_view("feed_events")
catalog.list_entities()       # the shared FeatureEntity objects
catalog.to_dict()             # JSON-serializable, for the API / UI

The loader keeps the wire-format structs (FeatureViewConfig, SourceConfig, FieldConfigmsgspec.Struct for fast deserialization) separate from the domain model (FeatureViewDef, DataSourceDef, FeatureField, FeatureEntity in src/daedalus/catalog/model.py). The model classes are plain dataclasses with to_dict() for serialization.

A real feature view

feature_views/feed_events.yaml defines the user–artwork interaction spine (abridged):

yaml
name: feed_events
entity: [user, artwork]
description: "User-artwork feed interactions (ods_feed_spine_d schema)"
ttl: 90d
online: false
source:
  # Directory path = recursive parquet glob with hive partitioning.
  path: "data/mewtant/feed"
  timestamp_field: event_timestamp
fields:
  - name: user_id
    dtype: BIGINT
  - name: artwork_id
    dtype: BIGINT
  - name: event
    dtype: TEXT
  - name: event_timestamp
    dtype: BIGINT
  - name: aes_score
    dtype: DOUBLE
  - name: tack_ids
    dtype: BIGINT[]      # variable-length list → list<int64>

A view with a single entity and a field_mapping (renaming the source's id column to the canonical artwork_id) looks like artwork_vector.yaml:

yaml
name: artwork_vector
entity: artwork
ttl: 365d
online: true
source:
  path: "data/mewtant/siglip2_vectors"
  timestamp_field: created_at
  field_mapping:
    id: artwork_id
fields:
  - name: media_id
    dtype: BIGINT

Dtype parsing → Arrow types

Each field's dtype is a DuckDB-style type string parsed into a pyarrow.DataType by parse_dtype (src/daedalus/catalog/types.py). The catalog is Arrow-native end to end, so the dtype string is the only place DuckDB type names appear in a definition.

parse_dtype handles scalars, parameterized scalars (parameters are stripped), variable-length arrays, fixed-size arrays, and LIST(...) wrappers:

YAML dtypeParsed Arrow type
FLOATfloat32
DOUBLEfloat64
BIGINTint64
INTEGER / INTint32
VARCHAR / TEXT / STRINGstring
BOOLEANbool_
TIMESTAMPtimestamp("us")
VARCHAR(255)string (parameters stripped)
FLOAT[] / DOUBLE[]list_(float32) / list_(float64)
LIST(INT)list_(int32)
BIGINT[]list_(int64)
FLOAT[1152]list_(float32, 1152) (fixed-size list)
python
import pyarrow as pa
from daedalus.catalog.types import parse_dtype

parse_dtype("BIGINT")       # DataType(int64)
parse_dtype("FLOAT[]")      # ListType(list<item: float>)
parse_dtype("LIST(INT)")    # ListType(list<item: int32>)
parse_dtype("FLOAT[1152]")  # FixedSizeListType(fixed_size_list<item: float>[1152])

Fixed-size arrays vs. variable-length arrays

FLOAT[1152] (fixed-size) is matched before the [] (variable-length) suffix, so a dimensioned suffix is never misread as an unknown type. The dimension must be a bare positive integer — FLOAT[1.5], FLOAT[abc], and FLOAT[] with a blank size all raise ValueError. An unrecognized base type also raises ValueError rather than silently passing through.

The reverse direction — dtype_to_str — renders an Arrow type back to a human-readable DuckDB-style string for the API / UI. FixedSizeListType is tested before ListType because it is not a subclass of it.

Entities and join keys

Entities are defined once and shared across all views (the ENTITIES map in registry.py). A view declares entity: <name> or entity: [<name>, ...]; the loader resolves each name against the shared map and attaches the FeatureEntity objects, so a view never re-invents its join keys.

EntityJoin keyValue type
useruser_idint64
artworkartwork_idint64

FeatureViewDef.join_keys flattens the join keys of all attached entities, and entity_names lists their names. See the architecture overview for why this is a load-bearing convention.

DataSourceDef — where the data lives

Every view's source block becomes a DataSourceDef. Beyond the parquet path, a source may attach a real backend (DuckLake / Postgres / Snowflake) and reference a table_name or query:

FieldMeaning
pathParquet directory (recursive glob, hive-partitioned)
timestamp_fieldThe event-time column (default event_timestamp)
field_mappingRename source columns to canonical names (e.g. id → artwork_id)
table_name / queryAttached-backend relation or SQL
database_path:memory: (default), a DuckLake DSN, or snowflake://…
data_path, catalog_alias, s3_*DuckLake DATA_PATH, attach alias, object-store credentials
snowflake_*Snowflake account / warehouse / key-pair JWT auth

Secrets are never inlined

Credential fields (s3_secret, snowflake_private_key, …) are supplied as ${ENV_VAR} references, expanded at source-build time. to_dict() masks connection strings and secrets (mask_conn / mask_secret) so the catalog can be serialized for the API / UI without leaking credentials.

The polymorphic source backends (parquet, DuckLake, Snowflake, Postgres) are config-only and covered in Data Sources.

Feature services — column-level contracts

Where a view describes a source's columns, a feature service (src/daedalus/catalog/service.py) describes a model's resolved input schema: an ordered, column-level contract over feature-view columns, pipeline-derived columns, and event-spine context columns. A column reference is spelled view:column (FeatureColumnRef.from_string), and each column carries a shape kind (scalar / sequence / embedding), null semantics, and a sensitivity.

The service is what the operator pipeline compiles from: it decides which views to scan, which rolling aggregations and point-in-time lookups to emit, and which embedding columns to enrich. See Operator Pipeline.