-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Summary
When using store.push() with PushMode.OFFLINE or PushMode.ONLINE_AND_OFFLINE, array/list type columns (e.g., STRING_LIST) are written as empty arrays [] to BigQuery, even though the data is correct in the DataFrame and PyArrow table.
Root Cause
The BigQuery LoadJobConfig in offline_write_batch() is missing parquet_options.enable_list_inference = True. Without this option, BigQuery's parquet loader doesn't correctly interpret PyArrow's list format.
Related issue: googleapis/python-bigquery#2370 (comment)
Steps to Reproduce
from feast import FeatureStore
from feast.data_source import PushMode
import pandas as pd
from datetime import datetime, timezone
# Assuming feature view with STRING_LIST field is configured
data = {
"entity_id": "test_123",
"tags": ["category_a", "category_b"], # STRING_LIST type
"event_time": datetime.now(timezone.utc),
}
df = pd.DataFrame([data])
store = FeatureStore(repo_path=".")
store.push("my_push_source", df, to=PushMode.ONLINE_AND_OFFLINE)
# Result in BigQuery: tags = [] (empty array)
# Expected: tags = ["category_a", "category_b"]
Expected Behavior
Array data should be correctly written to BigQuery with values preserved.
Actual Behavior
Array columns are written as empty arrays [] in BigQuery, while the online store receives correct data.
Proposed Fix
In feast/infra/offline_stores/bigquery.py, update offline_write_batch() (~line 428):
@staticmethod
def offline_write_batch(
config: RepoConfig,
feature_view: FeatureView,
table: pyarrow.Table,
progress: Optional[Callable[[int], Any]],
):
# ... existing code ...
parquet_options = bigquery.ParquetOptions()
parquet_options.enable_list_inference = True
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.PARQUET,
schema=arrow_schema_to_bq_schema(pa_schema),
create_disposition=config.offline_store.table_create_disposition,
write_disposition="WRITE_APPEND",
parquet_options=parquet_options, # Add this line
)
# ... rest of code ...
Environment
- Feast version: 0.58.0
- Python version: 3.12
- BigQuery client version: (latest)
Additional Context
- Online store (PostgreSQL) receives array data correctly
- The PyArrow table contains correct array data before parquet write
- Parquet file contains correct data when read locally
- Only BigQuery load loses the array content
- Using load_table_from_json instead of parquet works correctly
- Adding enable_list_inference=True to ParquetOptions fixes the issue