-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Offline Store historical features retrieval based on datetime range in Ray #5738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jyejare
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good initially, have some doubts.
Also needs to add tests.
| return pa.Table.from_pandas(df).schema | ||
|
|
||
|
|
||
| def _compute_non_entity_dates_ray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have make a common utility function for this, so that it can be used in all stores without repeating the code.
wdyt ?
| return _filter_range | ||
|
|
||
|
|
||
| def _make_select_distinct_keys(join_keys: List[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not drop rows with duplicate IDs, because there could be multiple transactions per ID and we need to choose the row based on timestamp while joining the colums from another table/view. I think this is the same case with your spark PR.
Please check the postgres implementation to understand the case.
Or Am I misreading this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing the case after discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, when entity_df=None was passed to get_historical_features(), the Ray offline store would extract only distinct entity keys and assign a single fixed timestamp (end_date) to all entities. This broke point-in-time joins for cases where multiple transactions exist per entity ID in date-time range.
Now extracts distinct (entity_keys, event_timestamp) combinations, aligning with Postgres based offline store's behaviour.
…ange in Ray Signed-off-by: Aniket Paluskar <[email protected]>
Signed-off-by: Aniket Paluskar <[email protected]>
… joins Signed-off-by: Aniket Paluskar <[email protected]>
Signed-off-by: Aniket Paluskar <[email protected]>
Signed-off-by: Aniket Paluskar <[email protected]>
408f40c to
f589956
Compare
# [0.59.0](v0.58.0...v0.59.0) (2026-01-16) ### Bug Fixes * Add get_table_query_string_with_alias() for PostgreSQL subquery aliasing ([#5811](#5811)) ([11122ce](11122ce)) * Add hybrid online store to ONLINE_STORE_CLASS_FOR_TYPE mapping ([#5810](#5810)) ([678589b](678589b)) * Add possibility to overwrite send_receive_timeout for clickhouse offline store ([#5792](#5792)) ([59dbb33](59dbb33)) * Denial by default to all resources when no permissions set ([#5663](#5663)) ([1524f1c](1524f1c)) * Make operator include full OIDC secret in repo config ([#5676](#5676)) ([#5809](#5809)) ([a536bc2](a536bc2)) * Populate Postgres `registry.path` during `feast init` ([#5785](#5785)) ([f293ae8](f293ae8)) * **redis:** Preserve millisecond timestamp precision for Redis online store ([#5807](#5807)) ([9e3f213](9e3f213)) * Search API to return all matching tags in matched_tags field ([#5843](#5843)) ([de37f66](de37f66)) * Spark Materialization Engine Cannot Infer Schema ([#5806](#5806)) ([58d0325](58d0325)), closes [#5594](#5594) [#5594](#5594) * Support arro3 table schema with newer deltalake packages ([#5799](#5799)) ([103c5e9](103c5e9)) * Timestamp formatting and lakehouse-type connector for trino_offline_store. ([#5846](#5846)) ([c2ea7e9](c2ea7e9)) * Update model_validator to use instance method signature (Pydantic v2.12 deprecation) ([#5825](#5825)) ([3c10b6e](3c10b6e)) ### Features * Add dbt integration for importing models as FeatureViews ([#5827](#5827)) ([b997361](b997361)), closes [#3335](#3335) [#3335](#3335) [#3335](#3335) * Add GCS registry store in Go feature server ([#5818](#5818)) ([1dc2be5](1dc2be5)) * Add progress bar to CLI from feast apply ([#5867](#5867)) ([ab3562b](ab3562b)) * Add RBAC blog post to website ([#5861](#5861)) ([b1844a3](b1844a3)) * Add skip_feature_view_validation parameter to FeatureStore.apply() and plan() ([#5859](#5859)) ([5482a0e](5482a0e)) * Added batching to feature server /push to offline store ([#5683](#5683)) ([#5729](#5729)) ([ce35ce6](ce35ce6)) * Enable static artifacts for feature server that can be used in Feature Transformations ([#5787](#5787)) ([edefc3f](edefc3f)) * Improve lambda materialization engine ([#5829](#5829)) ([f6116f9](f6116f9)) * Offline Store historical features retrieval based on datetime range in Ray ([#5738](#5738)) ([e484c12](e484c12)) * Read, Save docs and chat fixes ([#5865](#5865)) ([2081b55](2081b55)) * Resolve pyarrow >21 installation with ibis-framework ([#5847](#5847)) ([8b9bb50](8b9bb50)) * Support staging for spark materialization ([#5671](#5671)) ([#5797](#5797)) ([5b787af](5b787af))
What this PR does / why we need it:
Add support for entity_df=None in RayOfflineStore.get_historical_features with start_date/end_date.
-- Derives entity set by reading distinct join keys from each FeatureView source within the time window, applies field mappings and join_key_map, filters by timestamp, and unions aligned schemas.
-- Adds stable event_timestamp = end_date for PIT joins.
Signature change: get_historical_features accepts entity_df: Optional[Union[pd.DataFrame, str]] and **kwargs.
-- Why: Match base interface and support date-only retrieval.
Which issue(s) this PR fixes:
RHOAIENG-38643
Misc