Skip to content

BigQuery offline store loses array data when pushing features with list types #5845

@max36067

Description

@max36067

Summary

When using store.push() with PushMode.OFFLINE or PushMode.ONLINE_AND_OFFLINE, array/list type columns (e.g., STRING_LIST) are written as empty arrays [] to BigQuery, even though the data is correct in the DataFrame and PyArrow table.

Root Cause

The BigQuery LoadJobConfig in offline_write_batch() is missing parquet_options.enable_list_inference = True. Without this option, BigQuery's parquet loader doesn't correctly interpret PyArrow's list format.

Related issue: googleapis/python-bigquery#2370 (comment)

Steps to Reproduce

  from feast import FeatureStore
  from feast.data_source import PushMode
  import pandas as pd
  from datetime import datetime, timezone

  # Assuming feature view with STRING_LIST field is configured
  data = {
      "entity_id": "test_123",
      "tags": ["category_a", "category_b"],  # STRING_LIST type
      "event_time": datetime.now(timezone.utc),
  }

  df = pd.DataFrame([data])
  store = FeatureStore(repo_path=".")
  store.push("my_push_source", df, to=PushMode.ONLINE_AND_OFFLINE)

  # Result in BigQuery: tags = [] (empty array)
  # Expected: tags = ["category_a", "category_b"]

Expected Behavior

Array data should be correctly written to BigQuery with values preserved.

Actual Behavior

Array columns are written as empty arrays [] in BigQuery, while the online store receives correct data.

Proposed Fix

In feast/infra/offline_stores/bigquery.py, update offline_write_batch() (~line 428):

  @staticmethod
  def offline_write_batch(
      config: RepoConfig,
      feature_view: FeatureView,
      table: pyarrow.Table,
      progress: Optional[Callable[[int], Any]],
  ):
      # ... existing code ...

      parquet_options = bigquery.ParquetOptions()
      parquet_options.enable_list_inference = True

      job_config = bigquery.LoadJobConfig(
          source_format=bigquery.SourceFormat.PARQUET,
          schema=arrow_schema_to_bq_schema(pa_schema),
          create_disposition=config.offline_store.table_create_disposition,
          write_disposition="WRITE_APPEND",
          parquet_options=parquet_options,  # Add this line
      )
      # ... rest of code ...

Environment

  • Feast version: 0.58.0
  • Python version: 3.12
  • BigQuery client version: (latest)

Additional Context

  • Online store (PostgreSQL) receives array data correctly
  • The PyArrow table contains correct array data before parquet write
  • Parquet file contains correct data when read locally
  • Only BigQuery load loses the array content
  • Using load_table_from_json instead of parquet works correctly
  • Adding enable_list_inference=True to ParquetOptions fixes the issue

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions