Feature Catalog

The feature catalog is the declarative spine of Daedalus. Feature views are defined as YAML in feature_views/*.yaml and loaded into an in-memory registry by FeatureCatalog.from_yaml_dir() (src/daedalus/catalog/registry.py). Adding a feature is a YAML edit — there is no Python registration step.

Loading views from YAML

FeatureCatalog.from_yaml_dir(definitions_dir) globs every *.yaml in the directory (sorted), parses each through load_feature_view_from_yaml (src/daedalus/catalog/loader.py), and returns a FeatureCatalog holding the shared entity map plus one FeatureViewDef per file.

python

from pathlib import Path
from daedalus.catalog.registry import FeatureCatalog

catalog = FeatureCatalog.from_yaml_dir(Path("feature_views"))
catalog.list_views()          # all FeatureViewDef objects
catalog.get_view("feed_events")
catalog.list_entities()       # the shared FeatureEntity objects
catalog.to_dict()             # JSON-serializable, for the API / UI

The loader keeps the wire-format structs (FeatureViewConfig, SourceConfig, FieldConfig — msgspec.Struct for fast deserialization) separate from the domain model (FeatureViewDef, DataSourceDef, FeatureField, FeatureEntity in src/daedalus/catalog/model.py). The model classes are plain dataclasses with to_dict() for serialization.

A real feature view

feature_views/feed_events.yaml defines the user–artwork interaction spine (abridged):

yaml

name: feed_events
entity: [user, artwork]
description: "User-artwork feed interactions (ods_feed_spine_d schema)"
ttl: 90d
online: false
source:
  # Directory path = recursive parquet glob with hive partitioning.
  path: "data/mewtant/feed"
  timestamp_field: event_timestamp
fields:
  - name: user_id
    dtype: BIGINT
  - name: artwork_id
    dtype: BIGINT
  - name: event
    dtype: TEXT
  - name: event_timestamp
    dtype: BIGINT
  - name: aes_score
    dtype: DOUBLE
  - name: tack_ids
    dtype: BIGINT[]      # variable-length list → list<int64>

A view with a single entity and a field_mapping (renaming the source's id column to the canonical artwork_id) looks like artwork_vector.yaml:

yaml

name: artwork_vector
entity: artwork
ttl: 365d
online: true
source:
  path: "data/mewtant/siglip2_vectors"
  timestamp_field: created_at
  field_mapping:
    id: artwork_id
fields:
  - name: media_id
    dtype: BIGINT

Dtype parsing → Arrow types

Each field's dtype is a DuckDB-style type string parsed into a pyarrow.DataType by parse_dtype (src/daedalus/catalog/types.py). The catalog is Arrow-native end to end, so the dtype string is the only place DuckDB type names appear in a definition.

parse_dtype handles scalars, parameterized scalars (parameters are stripped), variable-length arrays, fixed-size arrays, and LIST(...) wrappers:

YAML dtype	Parsed Arrow type
`FLOAT`	`float32`
`DOUBLE`	`float64`
`BIGINT`	`int64`
`INTEGER` / `INT`	`int32`
`VARCHAR` / `TEXT` / `STRING`	`string`
`BOOLEAN`	`bool_`
`TIMESTAMP`	`timestamp("us")`
`VARCHAR(255)`	`string` (parameters stripped)
`FLOAT[]` / `DOUBLE[]`	`list_(float32)` / `list_(float64)`
`LIST(INT)`	`list_(int32)`
`BIGINT[]`	`list_(int64)`
`FLOAT[1152]`	`list_(float32, 1152)` (fixed-size list)

python

import pyarrow as pa
from daedalus.catalog.types import parse_dtype

parse_dtype("BIGINT")       # DataType(int64)
parse_dtype("FLOAT[]")      # ListType(list<item: float>)
parse_dtype("LIST(INT)")    # ListType(list<item: int32>)
parse_dtype("FLOAT[1152]")  # FixedSizeListType(fixed_size_list<item: float>[1152])

Fixed-size arrays vs. variable-length arrays

FLOAT[1152] (fixed-size) is matched before the [] (variable-length) suffix, so a dimensioned suffix is never misread as an unknown type. The dimension must be a bare positive integer — FLOAT[1.5], FLOAT[abc], and FLOAT[] with a blank size all raise ValueError. An unrecognized base type also raises ValueError rather than silently passing through.

The reverse direction — dtype_to_str — renders an Arrow type back to a human-readable DuckDB-style string for the API / UI. FixedSizeListType is tested before ListType because it is not a subclass of it.

Entities and join keys

Entities are defined once and shared across all views (the ENTITIES map in registry.py). A view declares entity: <name> or entity: [<name>, ...]; the loader resolves each name against the shared map and attaches the FeatureEntity objects, so a view never re-invents its join keys.

Entity	Join key	Value type
`user`	`user_id`	`int64`
`artwork`	`artwork_id`	`int64`

FeatureViewDef.join_keys flattens the join keys of all attached entities, and entity_names lists their names. See the architecture overview for why this is a load-bearing convention.

DataSourceDef — where the data lives

Every view's source block becomes a DataSourceDef. Beyond the parquet path, a source may attach a real backend (DuckLake / Postgres / Snowflake) and reference a table_name or query:

Field	Meaning
`path`	Parquet directory (recursive glob, hive-partitioned)
`timestamp_field`	The event-time column (default `event_timestamp`)
`field_mapping`	Rename source columns to canonical names (e.g. `id → artwork_id`)
`table_name` / `query`	Attached-backend relation or SQL
`database_path`	`:memory:` (default), a DuckLake DSN, or `snowflake://…`
`data_path`, `catalog_alias`, `s3_*`	DuckLake DATA_PATH, attach alias, object-store credentials
`snowflake_*`	Snowflake account / warehouse / key-pair JWT auth

Secrets are never inlined

Credential fields (s3_secret, snowflake_private_key, …) are supplied as ${ENV_VAR} references, expanded at source-build time. to_dict() masks connection strings and secrets (mask_conn / mask_secret) so the catalog can be serialized for the API / UI without leaking credentials.

The polymorphic source backends (parquet, DuckLake, Snowflake, Postgres) are config-only and covered in Data Sources.

Feature services — column-level contracts

Where a view describes a source's columns, a feature service (src/daedalus/catalog/service.py) describes a model's resolved input schema: an ordered, column-level contract over feature-view columns, pipeline-derived columns, and event-spine context columns. A column reference is spelled view:column (FeatureColumnRef.from_string), and each column carries a shape kind (scalar / sequence / embedding), null semantics, and a sensitivity.

The service is what the operator pipeline compiles from: it decides which views to scan, which rolling aggregations and point-in-time lookups to emit, and which embedding columns to enrich. See Operator Pipeline.

Feature Catalog ​

Loading views from YAML ​

A real feature view ​

Dtype parsing → Arrow types ​

Entities and join keys ​

DataSourceDef — where the data lives ​

Feature services — column-level contracts ​