Data integration in Feast manages how feature data flows between various storage systems. This encompasses reading historical data from data warehouses and batch sources (offline stores), serving low-latency features to online applications (online stores), and materializing features from offline to online storage. The system uses a pluggable provider architecture that abstracts infrastructure-specific implementations, enabling Feast to work across different cloud providers and data platforms.
For information about specific data source types, see Data Sources. For offline store implementations and historical feature retrieval, see Offline Stores. For online store implementations and feature serving, see Online Stores. For the materialization process, see Materialization and Compute Engines.
Feast's data integration architecture separates concerns through abstraction layers, allowing users to swap storage backends without changing application code.
Sources: sdk/python/feast/feature_store.py100-203 sdk/python/feast/infra/provider.py49-63 sdk/python/feast/infra/passthrough_provider.py58-129 sdk/python/feast/repo_config.py193-256
The FeatureStore class serves as the main entry point and delegates operations to a Provider implementation. The default PassthroughProvider orchestrates three types of components:
The RepoConfig class (sdk/python/feast/repo_config.py193-256) binds these components together via configuration mappings:
OFFLINE_STORE_CLASS_FOR_TYPE maps store types to implementation classes (sdk/python/feast/repo_config.py90-106)ONLINE_STORE_CLASS_FOR_TYPE maps online store types (sdk/python/feast/repo_config.py68-88)BATCH_ENGINE_CLASS_FOR_TYPE maps compute engine types (sdk/python/feast/repo_config.py46-53)Sources: sdk/python/feast/data_source.py sdk/python/feast/feature_store.py44-50 sdk/python/feast/repo_operations.py145-156
DataSource objects define where feature data originates. Each data source specifies:
timestamp_field, created_timestamp_column, field_mappingStreaming sources (PushSource, KafkaSource, KinesisSource) also reference a batch_source for historical data, enabling unified feature definitions across streaming and batch pipelines (sdk/python/feast/feature_store.py913-923).
Offline stores handle historical feature retrieval through point-in-time correct SQL joins. They read data from data warehouses, file systems, or compute engines.
Sources: sdk/python/feast/infra/offline_stores/offline_store.py73-295 sdk/python/feast/infra/provider.py248-281
The OfflineStore interface defines how to retrieve historical features:
Key Methods:
get_historical_features(): Performs point-in-time joins between entity DataFrame and feature views (sdk/python/feast/infra/offline_stores/bigquery.py234-341)pull_latest_from_table_or_query(): Retrieves latest feature values for materialization (sdk/python/feast/infra/offline_stores/bigquery.py126-184)pull_all_from_table_or_query(): Retrieves all feature values in a time range (sdk/python/feast/infra/offline_stores/bigquery.py185-232)offline_write_batch(): Writes features back to the offline store (sdk/python/feast/infra/offline_stores/bigquery.py399-451)All retrieval methods return a RetrievalJob object that lazily executes the query. The job can be converted to pandas, Arrow, or SQL, and supports on-demand feature transformations (sdk/python/feast/infra/offline_stores/offline_store.py140-184).
| Store Type | Class | Data Sources | Key Features |
|---|---|---|---|
| BigQuery | BigQueryOfflineStore | BigQuerySource | Native SQL, GCS staging, point-in-time joins |
| Snowflake | SnowflakeOfflineStore | SnowflakeSource | Native SQL, Snowflake compute |
| Redshift | RedshiftOfflineStore | RedshiftSource | SQL via Data API, S3 staging |
| Spark | SparkOfflineStore | SparkSource | Distributed compute, table/query sources |
| PostgreSQL | PostgreSQLOfflineStore | PostgreSQLSource | Temp tables or embedded queries |
| Trino | TrinoOfflineStore | TrinoSource | Federated queries, multiple backends |
| Dask | DaskOfflineStore | FileSource | Parquet files, local/S3/GCS |
Sources: sdk/python/feast/repo_config.py90-106 sdk/python/feast/infra/offline_stores/bigquery.py125-456 sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py79-402 sdk/python/feast/infra/offline_stores/contrib/postgres_offline_store/postgres.py66-387
Each implementation follows the same pattern:
RepoConfig and DataSource as inputsRetrievalJob wrappers around query resultsFor example, BigQuery uses ROW_NUMBER() OVER(PARTITION BY ...) for deduplication (sdk/python/feast/infra/offline_stores/bigquery.py164-176), while Spark uses window functions with the same semantics (sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py129-140).
Online stores provide low-latency feature retrieval for real-time inference. They store pre-computed feature values keyed by entity identifiers.
Sources: sdk/python/feast/infra/online_stores/online_store.py33-407 sdk/python/feast/infra/provider.py123-173
The OnlineStore interface defines two categories of operations:
Infrastructure Management:
plan(): Returns infrastructure objects needed for feature views (sdk/python/feast/infra/online_stores/sqlite.py204-240)update(): Creates or modifies tables for feature views (sdk/python/feast/infra/online_stores/sqlite.py242-308)teardown(): Removes tables when feature views are deleted (sdk/python/feast/infra/online_stores/sqlite.py310-330)Data Operations:
online_write_batch(): Writes feature values with entity keys and timestamps (sdk/python/feast/infra/online_stores/sqlite.py150-202)online_read(): Reads feature values by entity keys (sdk/python/feast/infra/passthrough_provider.py225-237)get_online_features(): High-level API that resolves feature services and entity rows (sdk/python/feast/infra/online_stores/online_store.py156-240)| Store Type | Class | Key Features |
|---|---|---|
| SQLite | SqliteOnlineStore | Local development, vector search support (v3.10) |
| Redis | RedisOnlineStore | Production-ready, clustering, TTL support |
| DynamoDB | DynamoDBOnlineStore | AWS-native, on-demand capacity, TTL support |
| PostgreSQL | PostgreSQLOnlineStore | ACID guarantees, vector support via pgvector |
| Cassandra | CassandraOnlineStore | High availability, distributed writes |
| Bigtable | BigtableOnlineStore | GCP-native, low latency at scale |
| Milvus | MilvusOnlineStore | Vector-native, similarity search optimized |
Sources: sdk/python/feast/repo_config.py68-88 sdk/python/feast/infra/online_stores/sqlite.py117-332 sdk/python/feast/infra/online_stores/milvus_online_store/milvus.py86-450
Each online store implements entity key serialization differently. For example:
serialize_entity_key() (sdk/python/feast/infra/online_stores/sqlite.py166-170)Online stores also handle feature value encoding using Protocol Buffer ValueProto messages for type safety (sdk/python/feast/infra/online_stores/sqlite.py415-446).
Materialization is the process of copying feature values from the offline store to the online store, making them available for low-latency serving.
Sources: sdk/python/feast/feature_store.py1349-1434 sdk/python/feast/infra/provider.py222-246 sdk/python/feast/infra/passthrough_provider.py404-615
The materialization process involves several steps:
Feature View Selection: _get_feature_views_to_materialize() identifies which feature views need materialization based on the online=True flag (sdk/python/feast/feature_store.py658-723)
Offline Retrieval: pull_latest_from_table_or_query() creates a SQL query that deduplicates features using window functions, keeping only the latest value per entity within the time range (sdk/python/feast/infra/offline_stores/bigquery.py126-184)
Batch Processing: The ComputeEngine reads data in batches to avoid memory issues. Default batch size is 10,000 rows (sdk/python/feast/infra/passthrough_provider.py55)
Entity Key Serialization: Entity values are serialized into EntityKeyProto messages using serialize_entity_key() (sdk/python/feast/infra/key_encoding_utils.py)
Online Write: online_write_batch() writes tuples of (EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]) to the online store (sdk/python/feast/infra/provider.py123-147)
Metadata Update: Registry tracks the last successful materialization timestamp per feature view (sdk/python/feast/feature_store.py1420-1422)
Sources: sdk/python/feast/feature_store.py1672-1876 sdk/python/feast/infra/offline_stores/bigquery.py234-341 sdk/python/feast/infra/offline_stores/offline_utils.py74-399
Point-in-time joins ensure training data correctness by joining entity rows with feature values that existed at the exact timestamp of each row:
Entity DataFrame Upload: The entity DataFrame is uploaded to a temporary table in the offline store (sdk/python/feast/infra/offline_stores/bigquery.py286-290)
Timestamp Inference: If not explicitly provided, Feast infers the event timestamp column by looking for datetime columns (sdk/python/feast/infra/offline_stores/offline_utils.py28-44)
Query Context Building: get_feature_view_query_context() creates a context object for each feature view containing join keys, feature names, and timestamp bounds (sdk/python/feast/infra/offline_stores/offline_utils.py193-399)
SQL Generation: build_point_in_time_query() generates SQL with LEFT JOIN clauses that include timestamp predicates ensuring no future data leakage (sdk/python/feast/infra/offline_stores/offline_utils.py462-1081)
Result Materialization: The RetrievalJob executes the query lazily when to_df() or to_arrow() is called (sdk/python/feast/infra/offline_stores/offline_store.py76-95)
Sources: sdk/python/feast/feature_store.py1878-2047 sdk/python/feast/infra/online_stores/online_store.py156-240 sdk/python/feast/on_demand_feature_view.py
Online feature retrieval is optimized for low latency:
Entity Key Serialization: Entity dictionaries are converted to EntityKeyProto using configured serialization version (sdk/python/feast/feature_store.py238-245)
Batch Read: online_read() retrieves features for multiple entities in a single operation (sdk/python/feast/infra/online_stores/sqlite.py332-413)
Feature Assembly: Results are assembled into an OnlineResponse object containing field values and metadata (sdk/python/feast/online_response.py)
On-Demand Transformations: If requested features include OnDemandFeatureView definitions, transformations are applied in-process before returning results (sdk/python/feast/on_demand_feature_view.py)
Data integration is configured through the feature_store.yaml file, which specifies offline store, online store, and provider settings.
Sources: sdk/python/feast/repo_config.py193-557
Key Configuration Fields:
| Field | Purpose | Options |
|---|---|---|
provider | Infrastructure orchestration | local, gcp, aws, azure (all use PassthroughProvider) |
offline_store.type | Offline store implementation | file, bigquery, snowflake.offline, redshift, spark, postgres, trino, dask |
online_store.type | Online store implementation | sqlite, redis, dynamodb, postgres, cassandra, bigtable, milvus |
batch_engine.type | Materialization compute | local, snowflake.engine, lambda, k8s, spark.engine |
registry.type | Metadata persistence | file, sql, snowflake.registry, remote |
The configuration system uses dynamic class loading via get_offline_config_from_type() and get_online_config_from_type() functions (sdk/python/feast/repo_config.py630-636 sdk/python/feast/repo_config.py600-608). This allows third-party implementations by specifying fully-qualified class names.
Data sources are defined in Python code rather than YAML:
Sources: tests/example_repos/example_feature_repo_1.py14-18 sdk/python/feast/data_source.py
Data sources support several configuration options:
timestamp_field (event time) and created_timestamp_column (write time)field_mapping dictThe provider is automatically determined by the provider field in feature_store.yaml. All standard providers (local, gcp, aws, azure) use PassthroughProvider (sdk/python/feast/infra/provider.py41-46), which delegates to the configured stores:
Sources: sdk/python/feast/infra/provider.py533-546
Custom providers can be implemented by extending the Provider abstract class and specifying the fully-qualified class name as the provider value.
Refresh this wiki
This wiki was recently refreshed. Please wait 4 days to refresh again.