Skip to content

ESQL: Support ST_EXTENT_AGG (#117451) #118829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

GalLalouche
Copy link
Contributor

Backporting #117451.
This PR adds support for ST_EXTENT_AGG aggregation, i.e., computing a bounding box over a set of points/shapes (Cartesian or geo). Note the difference between this aggregation and the already implemented scalar function ST_EXTENT.

This isn't a very efficient implementation, and future PRs will attempt to read these extents directly from the doc values. We currently always use longitude wrapping, i.e., we may wrap around the dateline for a smaller bounding box. Future PRs will let the user control this behavior. Fixes #104659.

@GalLalouche GalLalouche added >feature :Analytics/Geo Indexing, search aggregations of geo points and shapes backport Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Analytics/ES|QL AKA ESQL v8.18.0 labels Dec 17, 2024
Copy link
Contributor

Documentation preview:

@GalLalouche GalLalouche force-pushed the cherry_pick/st_extent branch 2 times, most recently from 5bd6b61 to 9efa791 Compare December 17, 2024 10:44
This PR adds support for ST_EXTENT_AGG aggregation, i.e., computing a bounding box over a set of points/shapes (Cartesian or geo). Note the difference between this aggregation and the already implemented scalar function ST_EXTENT.

This isn't a very efficient implementation, and future PRs will attempt to read these extents directly from the doc values.
We currently always use longitude wrapping, i.e., we may wrap around the dateline for a smaller bounding box. Future PRs will let the user control this behavior.
Fixes elastic#104659.
@GalLalouche GalLalouche force-pushed the cherry_pick/st_extent branch from 9efa791 to 2029c26 Compare December 17, 2024 11:01
@elasticsearchmachine elasticsearchmachine merged commit 905f9f4 into elastic:8.x Dec 17, 2024
15 checks passed
@GalLalouche GalLalouche deleted the cherry_pick/st_extent branch December 17, 2024 12:14
craigtaverner added a commit that referenced this pull request Jan 16, 2025
Support for `ST_EXTENT_AGG` was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Jan 16, 2025
)

Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Feb 11, 2025
)

Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
elasticsearchmachine pushed a commit that referenced this pull request Feb 12, 2025
…) (#122276)

* Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (#119889)

Support for `ST_EXTENT_AGG` was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.

* Regenerate generated files
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Feb 12, 2025
…ic#119889) (elastic#122276)

* Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (elastic#119889)

Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.

* Regenerate generated files
elasticsearchmachine pushed a commit that referenced this pull request Feb 13, 2025
…) (#122276) (#122420)

* Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (#119889)

Support for `ST_EXTENT_AGG` was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.

* Regenerate generated files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL :Analytics/Geo Indexing, search aggregations of geo points and shapes auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants