-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Avoid over collecting in Limit or Lucene Operator #123296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7754300
to
80a3936
Compare
80a3936
to
bbef844
Compare
Hi @dnhatn, I've created a changelog YAML for you. |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff. Sharing state in the pipeline execution (limit, min/max for filters) to trigger early execution is going to help significantly in high cardinality scenarios.
Thanks Costin! |
You can use sqren/backport to manually backport by running |
Currently, we rely on signal propagation for early termination. For example, FROM index | LIMIT 10 can be executed by multiple Drivers: several Drivers to read document IDs and extract fields, and the final Driver to select at most 10 rows. In this scenario, each Lucene Driver can independently collect up to 10 rows until the final Driver has enough rows and signals them to stop collecting. In most cases, this model works fine, but when extracting fields from indices in the warm/cold tier, it can impact performance. This change introduces a Limiter used between LimitOperator and LuceneSourceOperator to avoid over-collecting. We will also need a follow-up to ensure that we do not over-collect between multiple stages of query execution.
Currently, we rely on signal propagation for early termination. For example, FROM index | LIMIT 10 can be executed by multiple Drivers: several Drivers to read document IDs and extract fields, and the final Driver to select at most 10 rows. In this scenario, each Lucene Driver can independently collect up to 10 rows until the final Driver has enough rows and signals them to stop collecting. In most cases, this model works fine, but when extracting fields from indices in the warm/cold tier, it can impact performance. This change introduces a Limiter used between LimitOperator and LuceneSourceOperator to avoid over-collecting. We will also need a follow-up to ensure that we do not over-collect between multiple stages of query execution.
Currently, we rely on signal propagation for early termination. For example, FROM index | LIMIT 10 can be executed by multiple Drivers: several Drivers to read document IDs and extract fields, and the final Driver to select at most 10 rows. In this scenario, each Lucene Driver can independently collect up to 10 rows until the final Driver has enough rows and signals them to stop collecting. In most cases, this model works fine, but when extracting fields from indices in the warm/cold tier, it can impact performance. This change introduces a Limiter used between LimitOperator and LuceneSourceOperator to avoid over-collecting. We will also need a follow-up to ensure that we do not over-collect between multiple stages of query execution.
…23784) * Avoid over collecting in Limit or Lucene Operator (#123296) Currently, we rely on signal propagation for early termination. For example, FROM index | LIMIT 10 can be executed by multiple Drivers: several Drivers to read document IDs and extract fields, and the final Driver to select at most 10 rows. In this scenario, each Lucene Driver can independently collect up to 10 rows until the final Driver has enough rows and signals them to stop collecting. In most cases, this model works fine, but when extracting fields from indices in the warm/cold tier, it can impact performance. This change introduces a Limiter used between LimitOperator and LuceneSourceOperator to avoid over-collecting. We will also need a follow-up to ensure that we do not over-collect between multiple stages of query execution. * Fix compilation after #123784 * fix compile * fix compile
…123783) * Avoid over collecting in Limit or Lucene Operator (#123296) Currently, we rely on signal propagation for early termination. For example, FROM index | LIMIT 10 can be executed by multiple Drivers: several Drivers to read document IDs and extract fields, and the final Driver to select at most 10 rows. In this scenario, each Lucene Driver can independently collect up to 10 rows until the final Driver has enough rows and signals them to stop collecting. In most cases, this model works fine, but when extracting fields from indices in the warm/cold tier, it can impact performance. This change introduces a Limiter used between LimitOperator and LuceneSourceOperator to avoid over-collecting. We will also need a follow-up to ensure that we do not over-collect between multiple stages of query execution. * Fix compilation after #123784 * fix compile * fix compile
A follow-up to elastic#123296 to address a potential block leak that may occur when a circuit-breaking exception is triggered while truncating the docs or scores blocks. Relates elastic#123296 (cherry picked from commit 7560e2e)
A follow-up to elastic#123296 to address a potential block leak that may occur when a circuit-breaking exception is triggered while truncating the docs or scores blocks. Relates elastic#123296 (cherry picked from commit 7560e2e)
/** | ||
* A shared limiter used by multiple drivers to collect hits in parallel without exceeding the output limit. | ||
* For example, if the query `FROM test-1,test-2 | LIMIT 100` is run with two drivers, and one driver (e.g., querying `test-1`) | ||
* has collected 60 hits, then the other driver querying `test-2` should collect at most 40 hits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea!
I wonder if we should make it explicit that this works as long as test-1
and test-2
are on the same node?
Currently, we rely on signal propagation for early termination. For example,
FROM index | LIMIT 10
can be executed by multiple Drivers: several Drivers to read document IDs and extract fields, and the final Driver to select at most 10 rows. In this scenario, each Lucene Driver can independently collect up to 10 rows until the final Driver has enough rows and signals them to stop collecting. In most cases, this model works fine, but when extracting fields from indices in the warm/cold tier, it can impact performance. This change introduces a Limiter used between LimitOperator and LuceneSourceOperator to avoid over-collecting. We will also need a follow-up to ensure that we do not over-collect between multiple stages of query execution.