Data Integration

Relevant source files

Data integration in Feast manages how feature data flows between various storage systems. This encompasses reading historical data from data warehouses and batch sources (offline stores), serving low-latency features to online applications (online stores), and materializing features from offline to online storage. The system uses a pluggable provider architecture that abstracts infrastructure-specific implementations, enabling Feast to work across different cloud providers and data platforms.

For information about specific data source types, see Data Sources. For offline store implementations and historical feature retrieval, see Offline Stores. For online store implementations and feature serving, see Online Stores. For the materialization process, see Materialization and Compute Engines.

Architecture Overview

Feast's data integration architecture separates concerns through abstraction layers, allowing users to swap storage backends without changing application code.

Provider Pattern and Component Orchestration

Sources: sdk/python/feast/feature_store.py100-203 sdk/python/feast/infra/provider.py49-63 sdk/python/feast/infra/passthrough_provider.py58-129 sdk/python/feast/repo_config.py193-256

The FeatureStore class serves as the main entry point and delegates operations to a Provider implementation. The default PassthroughProvider orchestrates three types of components:

OfflineStore: Retrieves historical features for training
OnlineStore: Serves low-latency features for inference
ComputeEngine: Executes materialization jobs at scale

The RepoConfig class (sdk/python/feast/repo_config.py193-256) binds these components together via configuration mappings:

OFFLINE_STORE_CLASS_FOR_TYPE maps store types to implementation classes (sdk/python/feast/repo_config.py90-106)
ONLINE_STORE_CLASS_FOR_TYPE maps online store types (sdk/python/feast/repo_config.py68-88)
BATCH_ENGINE_CLASS_FOR_TYPE maps compute engine types (sdk/python/feast/repo_config.py46-53)

Data Source Abstraction

Sources: sdk/python/feast/data_source.py sdk/python/feast/feature_store.py44-50 sdk/python/feast/repo_operations.py145-156

DataSource objects define where feature data originates. Each data source specifies:

Storage location: Table name, file path, or SQL query
Schema fields: timestamp_field, created_timestamp_column, field_mapping
Backend compatibility: Each offline store supports specific data source types

Streaming sources (PushSource, KafkaSource, KinesisSource) also reference a batch_source for historical data, enabling unified feature definitions across streaming and batch pipelines (sdk/python/feast/feature_store.py913-923).

Offline Store Integration

Offline stores handle historical feature retrieval through point-in-time correct SQL joins. They read data from data warehouses, file systems, or compute engines.

Offline Store Interface

Sources: sdk/python/feast/infra/offline_stores/offline_store.py73-295 sdk/python/feast/infra/provider.py248-281

The OfflineStore interface defines how to retrieve historical features:

Key Methods:

get_historical_features(): Performs point-in-time joins between entity DataFrame and feature views (sdk/python/feast/infra/offline_stores/bigquery.py234-341)
pull_latest_from_table_or_query(): Retrieves latest feature values for materialization (sdk/python/feast/infra/offline_stores/bigquery.py126-184)
pull_all_from_table_or_query(): Retrieves all feature values in a time range (sdk/python/feast/infra/offline_stores/bigquery.py185-232)
offline_write_batch(): Writes features back to the offline store (sdk/python/feast/infra/offline_stores/bigquery.py399-451)

All retrieval methods return a RetrievalJob object that lazily executes the query. The job can be converted to pandas, Arrow, or SQL, and supports on-demand feature transformations (sdk/python/feast/infra/offline_stores/offline_store.py140-184).

Offline Store Implementations

Store Type	Class	Data Sources	Key Features
BigQuery	`BigQueryOfflineStore`	`BigQuerySource`	Native SQL, GCS staging, point-in-time joins
Snowflake	`SnowflakeOfflineStore`	`SnowflakeSource`	Native SQL, Snowflake compute
Redshift	`RedshiftOfflineStore`	`RedshiftSource`	SQL via Data API, S3 staging
Spark	`SparkOfflineStore`	`SparkSource`	Distributed compute, table/query sources
PostgreSQL	`PostgreSQLOfflineStore`	`PostgreSQLSource`	Temp tables or embedded queries
Trino	`TrinoOfflineStore`	`TrinoSource`	Federated queries, multiple backends
Dask	`DaskOfflineStore`	`FileSource`	Parquet files, local/S3/GCS

Sources: sdk/python/feast/repo_config.py90-106 sdk/python/feast/infra/offline_stores/bigquery.py125-456 sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py79-402 sdk/python/feast/infra/offline_stores/contrib/postgres_offline_store/postgres.py66-387

Each implementation follows the same pattern:

Accept RepoConfig and DataSource as inputs
Generate SQL queries or execution plans
Return RetrievalJob wrappers around query results
Handle backend-specific optimizations (partitioning, caching, etc.)

For example, BigQuery uses ROW_NUMBER() OVER(PARTITION BY ...) for deduplication (sdk/python/feast/infra/offline_stores/bigquery.py164-176), while Spark uses window functions with the same semantics (sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py129-140).

Online Store Integration

Online stores provide low-latency feature retrieval for real-time inference. They store pre-computed feature values keyed by entity identifiers.

Online Store Interface

Sources: sdk/python/feast/infra/online_stores/online_store.py33-407 sdk/python/feast/infra/provider.py123-173

The OnlineStore interface defines two categories of operations:

Infrastructure Management:

plan(): Returns infrastructure objects needed for feature views (sdk/python/feast/infra/online_stores/sqlite.py204-240)
update(): Creates or modifies tables for feature views (sdk/python/feast/infra/online_stores/sqlite.py242-308)
teardown(): Removes tables when feature views are deleted (sdk/python/feast/infra/online_stores/sqlite.py310-330)

Data Operations:

online_write_batch(): Writes feature values with entity keys and timestamps (sdk/python/feast/infra/online_stores/sqlite.py150-202)
online_read(): Reads feature values by entity keys (sdk/python/feast/infra/passthrough_provider.py225-237)
get_online_features(): High-level API that resolves feature services and entity rows (sdk/python/feast/infra/online_stores/online_store.py156-240)

Online Store Implementations

Store Type	Class	Key Features
SQLite	`SqliteOnlineStore`	Local development, vector search support (v3.10)
Redis	`RedisOnlineStore`	Production-ready, clustering, TTL support
DynamoDB	`DynamoDBOnlineStore`	AWS-native, on-demand capacity, TTL support
PostgreSQL	`PostgreSQLOnlineStore`	ACID guarantees, vector support via pgvector
Cassandra	`CassandraOnlineStore`	High availability, distributed writes
Bigtable	`BigtableOnlineStore`	GCP-native, low latency at scale
Milvus	`MilvusOnlineStore`	Vector-native, similarity search optimized

Sources: sdk/python/feast/repo_config.py68-88 sdk/python/feast/infra/online_stores/sqlite.py117-332 sdk/python/feast/infra/online_stores/milvus_online_store/milvus.py86-450

Each online store implements entity key serialization differently. For example:

SQLite serializes entity keys to binary using serialize_entity_key() (sdk/python/feast/infra/online_stores/sqlite.py166-170)
Milvus stores entity keys as base64-encoded strings for vector indexing (sdk/python/feast/infra/online_stores/milvus_online_store/milvus.py305-324)

Online stores also handle feature value encoding using Protocol Buffer ValueProto messages for type safety (sdk/python/feast/infra/online_stores/sqlite.py415-446).

Data Flow: Offline to Online

Materialization is the process of copying feature values from the offline store to the online store, making them available for low-latency serving.

Materialization Flow

Sources: sdk/python/feast/feature_store.py1349-1434 sdk/python/feast/infra/provider.py222-246 sdk/python/feast/infra/passthrough_provider.py404-615

The materialization process involves several steps:

Feature View Selection: _get_feature_views_to_materialize() identifies which feature views need materialization based on the online=True flag (sdk/python/feast/feature_store.py658-723)
Offline Retrieval: pull_latest_from_table_or_query() creates a SQL query that deduplicates features using window functions, keeping only the latest value per entity within the time range (sdk/python/feast/infra/offline_stores/bigquery.py126-184)
Batch Processing: The ComputeEngine reads data in batches to avoid memory issues. Default batch size is 10,000 rows (sdk/python/feast/infra/passthrough_provider.py55)
Entity Key Serialization: Entity values are serialized into EntityKeyProto messages using serialize_entity_key() (sdk/python/feast/infra/key_encoding_utils.py)
Online Write: online_write_batch() writes tuples of (EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]) to the online store (sdk/python/feast/infra/provider.py123-147)
Metadata Update: Registry tracks the last successful materialization timestamp per feature view (sdk/python/feast/feature_store.py1420-1422)

Query Flow for Historical Features

Sources: sdk/python/feast/feature_store.py1672-1876 sdk/python/feast/infra/offline_stores/bigquery.py234-341 sdk/python/feast/infra/offline_stores/offline_utils.py74-399

Point-in-time joins ensure training data correctness by joining entity rows with feature values that existed at the exact timestamp of each row:

Entity DataFrame Upload: The entity DataFrame is uploaded to a temporary table in the offline store (sdk/python/feast/infra/offline_stores/bigquery.py286-290)
Timestamp Inference: If not explicitly provided, Feast infers the event timestamp column by looking for datetime columns (sdk/python/feast/infra/offline_stores/offline_utils.py28-44)
Query Context Building: get_feature_view_query_context() creates a context object for each feature view containing join keys, feature names, and timestamp bounds (sdk/python/feast/infra/offline_stores/offline_utils.py193-399)
SQL Generation: build_point_in_time_query() generates SQL with LEFT JOIN clauses that include timestamp predicates ensuring no future data leakage (sdk/python/feast/infra/offline_stores/offline_utils.py462-1081)
Result Materialization: The RetrievalJob executes the query lazily when to_df() or to_arrow() is called (sdk/python/feast/infra/offline_stores/offline_store.py76-95)

Query Flow for Online Features

Sources: sdk/python/feast/feature_store.py1878-2047 sdk/python/feast/infra/online_stores/online_store.py156-240 sdk/python/feast/on_demand_feature_view.py

Online feature retrieval is optimized for low latency:

Entity Key Serialization: Entity dictionaries are converted to EntityKeyProto using configured serialization version (sdk/python/feast/feature_store.py238-245)
Batch Read: online_read() retrieves features for multiple entities in a single operation (sdk/python/feast/infra/online_stores/sqlite.py332-413)
Feature Assembly: Results are assembled into an OnlineResponse object containing field values and metadata (sdk/python/feast/online_response.py)
On-Demand Transformations: If requested features include OnDemandFeatureView definitions, transformations are applied in-process before returning results (sdk/python/feast/on_demand_feature_view.py)

Configuration

Data integration is configured through the feature_store.yaml file, which specifies offline store, online store, and provider settings.

Configuration Structure

Sources: sdk/python/feast/repo_config.py193-557

Key Configuration Fields:

Field	Purpose	Options
`provider`	Infrastructure orchestration	`local`, `gcp`, `aws`, `azure` (all use PassthroughProvider)
`offline_store.type`	Offline store implementation	`file`, `bigquery`, `snowflake.offline`, `redshift`, `spark`, `postgres`, `trino`, `dask`
`online_store.type`	Online store implementation	`sqlite`, `redis`, `dynamodb`, `postgres`, `cassandra`, `bigtable`, `milvus`
`batch_engine.type`	Materialization compute	`local`, `snowflake.engine`, `lambda`, `k8s`, `spark.engine`
`registry.type`	Metadata persistence	`file`, `sql`, `snowflake.registry`, `remote`

The configuration system uses dynamic class loading via get_offline_config_from_type() and get_online_config_from_type() functions (sdk/python/feast/repo_config.py630-636 sdk/python/feast/repo_config.py600-608). This allows third-party implementations by specifying fully-qualified class names.

Data Source Configuration

Data sources are defined in Python code rather than YAML:

Sources: tests/example_repos/example_feature_repo_1.py14-18 sdk/python/feast/data_source.py

Data sources support several configuration options:

Table/Query: Specify a table name or SQL query
Timestamp Fields: timestamp_field (event time) and created_timestamp_column (write time)
Field Mapping: Rename columns via field_mapping dict
Date Partitioning: Optimize queries with partition columns (Spark, Trino)

Provider Selection

The provider is automatically determined by the provider field in feature_store.yaml. All standard providers (local, gcp, aws, azure) use PassthroughProvider (sdk/python/feast/infra/provider.py41-46), which delegates to the configured stores:

Sources: sdk/python/feast/infra/provider.py533-546

Custom providers can be implemented by extending the Provider abstract class and specifying the fully-qualified class name as the provider value.

Data Integration

Relevant source files

Architecture Overview

Feast's data integration architecture separates concerns through abstraction layers, allowing users to swap storage backends without changing application code.

Provider Pattern and Component Orchestration

Sources: sdk/python/feast/feature_store.py100-203 sdk/python/feast/infra/provider.py49-63 sdk/python/feast/infra/passthrough_provider.py58-129 sdk/python/feast/repo_config.py193-256

The FeatureStore class serves as the main entry point and delegates operations to a Provider implementation. The default PassthroughProvider orchestrates three types of components:

OfflineStore: Retrieves historical features for training
OnlineStore: Serves low-latency features for inference
ComputeEngine: Executes materialization jobs at scale

The RepoConfig class (sdk/python/feast/repo_config.py193-256) binds these components together via configuration mappings:

OFFLINE_STORE_CLASS_FOR_TYPE maps store types to implementation classes (sdk/python/feast/repo_config.py90-106)
ONLINE_STORE_CLASS_FOR_TYPE maps online store types (sdk/python/feast/repo_config.py68-88)
BATCH_ENGINE_CLASS_FOR_TYPE maps compute engine types (sdk/python/feast/repo_config.py46-53)

Data Source Abstraction

Sources: sdk/python/feast/data_source.py sdk/python/feast/feature_store.py44-50 sdk/python/feast/repo_operations.py145-156

DataSource objects define where feature data originates. Each data source specifies:

Storage location: Table name, file path, or SQL query
Schema fields: timestamp_field, created_timestamp_column, field_mapping
Backend compatibility: Each offline store supports specific data source types

Offline Store Integration

Offline stores handle historical feature retrieval through point-in-time correct SQL joins. They read data from data warehouses, file systems, or compute engines.

Offline Store Interface

Sources: sdk/python/feast/infra/offline_stores/offline_store.py73-295 sdk/python/feast/infra/provider.py248-281

The OfflineStore interface defines how to retrieve historical features:

Key Methods:

get_historical_features(): Performs point-in-time joins between entity DataFrame and feature views (sdk/python/feast/infra/offline_stores/bigquery.py234-341)
pull_latest_from_table_or_query(): Retrieves latest feature values for materialization (sdk/python/feast/infra/offline_stores/bigquery.py126-184)
pull_all_from_table_or_query(): Retrieves all feature values in a time range (sdk/python/feast/infra/offline_stores/bigquery.py185-232)
offline_write_batch(): Writes features back to the offline store (sdk/python/feast/infra/offline_stores/bigquery.py399-451)

Offline Store Implementations

Store Type	Class	Data Sources	Key Features
BigQuery	`BigQueryOfflineStore`	`BigQuerySource`	Native SQL, GCS staging, point-in-time joins
Snowflake	`SnowflakeOfflineStore`	`SnowflakeSource`	Native SQL, Snowflake compute
Redshift	`RedshiftOfflineStore`	`RedshiftSource`	SQL via Data API, S3 staging
Spark	`SparkOfflineStore`	`SparkSource`	Distributed compute, table/query sources
PostgreSQL	`PostgreSQLOfflineStore`	`PostgreSQLSource`	Temp tables or embedded queries
Trino	`TrinoOfflineStore`	`TrinoSource`	Federated queries, multiple backends
Dask	`DaskOfflineStore`	`FileSource`	Parquet files, local/S3/GCS

Each implementation follows the same pattern:

Accept RepoConfig and DataSource as inputs
Generate SQL queries or execution plans
Return RetrievalJob wrappers around query results
Handle backend-specific optimizations (partitioning, caching, etc.)

Online Store Integration

Online stores provide low-latency feature retrieval for real-time inference. They store pre-computed feature values keyed by entity identifiers.

Online Store Interface

Sources: sdk/python/feast/infra/online_stores/online_store.py33-407 sdk/python/feast/infra/provider.py123-173

The OnlineStore interface defines two categories of operations:

Infrastructure Management:

plan(): Returns infrastructure objects needed for feature views (sdk/python/feast/infra/online_stores/sqlite.py204-240)
update(): Creates or modifies tables for feature views (sdk/python/feast/infra/online_stores/sqlite.py242-308)
teardown(): Removes tables when feature views are deleted (sdk/python/feast/infra/online_stores/sqlite.py310-330)

Data Operations:

online_write_batch(): Writes feature values with entity keys and timestamps (sdk/python/feast/infra/online_stores/sqlite.py150-202)
online_read(): Reads feature values by entity keys (sdk/python/feast/infra/passthrough_provider.py225-237)
get_online_features(): High-level API that resolves feature services and entity rows (sdk/python/feast/infra/online_stores/online_store.py156-240)

Online Store Implementations

Store Type	Class	Key Features
SQLite	`SqliteOnlineStore`	Local development, vector search support (v3.10)
Redis	`RedisOnlineStore`	Production-ready, clustering, TTL support
DynamoDB	`DynamoDBOnlineStore`	AWS-native, on-demand capacity, TTL support
PostgreSQL	`PostgreSQLOnlineStore`	ACID guarantees, vector support via pgvector
Cassandra	`CassandraOnlineStore`	High availability, distributed writes
Bigtable	`BigtableOnlineStore`	GCP-native, low latency at scale
Milvus	`MilvusOnlineStore`	Vector-native, similarity search optimized

Sources: sdk/python/feast/repo_config.py68-88 sdk/python/feast/infra/online_stores/sqlite.py117-332 sdk/python/feast/infra/online_stores/milvus_online_store/milvus.py86-450

Each online store implements entity key serialization differently. For example:

SQLite serializes entity keys to binary using serialize_entity_key() (sdk/python/feast/infra/online_stores/sqlite.py166-170)
Milvus stores entity keys as base64-encoded strings for vector indexing (sdk/python/feast/infra/online_stores/milvus_online_store/milvus.py305-324)

Online stores also handle feature value encoding using Protocol Buffer ValueProto messages for type safety (sdk/python/feast/infra/online_stores/sqlite.py415-446).

Data Flow: Offline to Online

Materialization is the process of copying feature values from the offline store to the online store, making them available for low-latency serving.

Materialization Flow

Sources: sdk/python/feast/feature_store.py1349-1434 sdk/python/feast/infra/provider.py222-246 sdk/python/feast/infra/passthrough_provider.py404-615

The materialization process involves several steps:

Feature View Selection: _get_feature_views_to_materialize() identifies which feature views need materialization based on the online=True flag (sdk/python/feast/feature_store.py658-723)
Offline Retrieval: pull_latest_from_table_or_query() creates a SQL query that deduplicates features using window functions, keeping only the latest value per entity within the time range (sdk/python/feast/infra/offline_stores/bigquery.py126-184)
Batch Processing: The ComputeEngine reads data in batches to avoid memory issues. Default batch size is 10,000 rows (sdk/python/feast/infra/passthrough_provider.py55)
Entity Key Serialization: Entity values are serialized into EntityKeyProto messages using serialize_entity_key() (sdk/python/feast/infra/key_encoding_utils.py)
Online Write: online_write_batch() writes tuples of (EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]) to the online store (sdk/python/feast/infra/provider.py123-147)
Metadata Update: Registry tracks the last successful materialization timestamp per feature view (sdk/python/feast/feature_store.py1420-1422)

Query Flow for Historical Features

Sources: sdk/python/feast/feature_store.py1672-1876 sdk/python/feast/infra/offline_stores/bigquery.py234-341 sdk/python/feast/infra/offline_stores/offline_utils.py74-399

Point-in-time joins ensure training data correctness by joining entity rows with feature values that existed at the exact timestamp of each row:

Entity DataFrame Upload: The entity DataFrame is uploaded to a temporary table in the offline store (sdk/python/feast/infra/offline_stores/bigquery.py286-290)
Timestamp Inference: If not explicitly provided, Feast infers the event timestamp column by looking for datetime columns (sdk/python/feast/infra/offline_stores/offline_utils.py28-44)
Query Context Building: get_feature_view_query_context() creates a context object for each feature view containing join keys, feature names, and timestamp bounds (sdk/python/feast/infra/offline_stores/offline_utils.py193-399)
SQL Generation: build_point_in_time_query() generates SQL with LEFT JOIN clauses that include timestamp predicates ensuring no future data leakage (sdk/python/feast/infra/offline_stores/offline_utils.py462-1081)
Result Materialization: The RetrievalJob executes the query lazily when to_df() or to_arrow() is called (sdk/python/feast/infra/offline_stores/offline_store.py76-95)

Query Flow for Online Features

Sources: sdk/python/feast/feature_store.py1878-2047 sdk/python/feast/infra/online_stores/online_store.py156-240 sdk/python/feast/on_demand_feature_view.py

Online feature retrieval is optimized for low latency:

Entity Key Serialization: Entity dictionaries are converted to EntityKeyProto using configured serialization version (sdk/python/feast/feature_store.py238-245)
Batch Read: online_read() retrieves features for multiple entities in a single operation (sdk/python/feast/infra/online_stores/sqlite.py332-413)
Feature Assembly: Results are assembled into an OnlineResponse object containing field values and metadata (sdk/python/feast/online_response.py)
On-Demand Transformations: If requested features include OnDemandFeatureView definitions, transformations are applied in-process before returning results (sdk/python/feast/on_demand_feature_view.py)

Configuration

Data integration is configured through the feature_store.yaml file, which specifies offline store, online store, and provider settings.

Configuration Structure

Sources: sdk/python/feast/repo_config.py193-557

Key Configuration Fields:

Field	Purpose	Options
`provider`	Infrastructure orchestration	`local`, `gcp`, `aws`, `azure` (all use PassthroughProvider)
`offline_store.type`	Offline store implementation	`file`, `bigquery`, `snowflake.offline`, `redshift`, `spark`, `postgres`, `trino`, `dask`
`online_store.type`	Online store implementation	`sqlite`, `redis`, `dynamodb`, `postgres`, `cassandra`, `bigtable`, `milvus`
`batch_engine.type`	Materialization compute	`local`, `snowflake.engine`, `lambda`, `k8s`, `spark.engine`
`registry.type`	Metadata persistence	`file`, `sql`, `snowflake.registry`, `remote`

Data Source Configuration

Data sources are defined in Python code rather than YAML:

Sources: tests/example_repos/example_feature_repo_1.py14-18 sdk/python/feast/data_source.py

Data sources support several configuration options:

Table/Query: Specify a table name or SQL query
Timestamp Fields: timestamp_field (event time) and created_timestamp_column (write time)
Field Mapping: Rename columns via field_mapping dict
Date Partitioning: Optimize queries with partition columns (Spark, Trino)

Provider Selection

Sources: sdk/python/feast/infra/provider.py533-546

Custom providers can be implemented by extending the Provider abstract class and specifying the fully-qualified class name as the provider value.

Data Integration

Architecture Overview

Provider Pattern and Component Orchestration

Data Source Abstraction

Offline Store Integration

Offline Store Interface

Offline Store Implementations

Online Store Integration

Online Store Interface

Online Store Implementations

Data Flow: Offline to Online

Materialization Flow

Query Flow for Historical Features

Query Flow for Online Features

Configuration

Configuration Structure

Data Source Configuration

Provider Selection

On this page

Data Integration

Architecture Overview

Provider Pattern and Component Orchestration

Data Source Abstraction

Offline Store Integration

Offline Store Interface

Offline Store Implementations

Online Store Integration

Online Store Interface

Online Store Implementations

Data Flow: Offline to Online

Materialization Flow

Query Flow for Historical Features

Query Flow for Online Features

Configuration

Configuration Structure

Data Source Configuration

Provider Selection

On this page