-
Notifications
You must be signed in to change notification settings - Fork 25.4k
LOOKUP JOIN using field-caps for field mapping #117246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOOKUP JOIN using field-caps for field mapping #117246
Conversation
Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @craigtaverner, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Wants a review from @astefan because he's refactoring this code and can probably answer questions like "TODO: why is this empty". If we can't get through all of the outstanding TODOs but @astefan is still happy let's get this merged because it's way better than my hack. We should just make sure to lodge the TODOs into the meta issue.
) { | ||
PreAnalyzer.PreAnalysis preAnalysis = new PreAnalyzer().preAnalyze(parsed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
})); | ||
} | ||
|
||
private void preAnalyzeIndices( | ||
LogicalPlan parsed, | ||
List<TableInfo> indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some fly-by questions.
EnrichResolution enrichResolution | ||
) {} | ||
) { | ||
public AnalyzerContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for testing-only use? Would it be worth creating an instance for those purposes? Or a method as a testing util creating it? Or maybe leaving a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for tests, but I did not want to follow the test approach to enrichResolution, which has three different ways of passing in an empty resolution, cluttering the tests. So this is an intermediate solution, which minimizes code changes (does not change test code), while getting the goal of the PR achieved. Then I would like a second PR that simplifies both this and the enrich resolution for tests (and cleans up test code all over the place).
if (plan.indexMode().equals(IndexMode.LOOKUP)) { | ||
return hackLookupMapping(plan); | ||
return resolveIndex(plan, context.lookupResolution()); | ||
} | ||
if (context.indexResolution().isValid() == false) { | ||
return plan.unresolvedMessage().equals(context.indexResolution().toString()) | ||
return resolveIndex(plan, context.indexResolution()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could be
return resolveIndex(plan, plan.indexMode().equals(IndexMode.LOOKUP) ? context.lookupResolution() : context.indexResolution());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I've made that simplification. My reason for the original version was I thought it looked more like the old code, so might be easier to review.
lx.delegateFailureAndWrap((ll, indexResolution) -> { | ||
// TODO in follow-PR (for skip_unavailble handling of missing concrete indexes) add some tests for invalid | ||
// index resolution to updateExecutionInfo | ||
if (indexResolution.isValid()) { | ||
EsqlSessionCCSUtils.updateExecutionInfoWithClustersWithNoMatchingIndices(executionInfo, indexResolution); | ||
EsqlSessionCCSUtils.updateExecutionInfoWithUnavailableClusters( | ||
executionInfo, | ||
indexResolution.unavailableClusters() | ||
); | ||
if (executionInfo.isCrossClusterSearch() | ||
&& executionInfo.getClusterStateCount(EsqlExecutionInfo.Cluster.Status.RUNNING) == 0) { | ||
// for a CCS, if all clusters have been marked as SKIPPED, nothing to search so send a sentinel | ||
// Exception to let the LogicalPlanActionListener decide how to proceed | ||
ll.onFailure(new NoClustersToSearchException()); | ||
return; | ||
} | ||
|
||
Set<String> newClusters = enrichPolicyResolver.groupIndicesPerCluster( | ||
indexResolution.get().concreteIndices().toArray(String[]::new) | ||
).keySet(); | ||
// If new clusters appear when resolving the main indices, we need to resolve the enrich policies again | ||
// or exclude main concrete indices. Since this is rare, it's simpler to resolve the enrich policies | ||
// again. | ||
// TODO: add a test for this | ||
if (targetClusters.containsAll(newClusters) == false | ||
// do not bother with a re-resolution if only remotes were requested and all were offline | ||
&& executionInfo.getClusterStateCount(EsqlExecutionInfo.Cluster.Status.RUNNING) > 0) { | ||
enrichPolicyResolver.resolvePolicies( | ||
newClusters, | ||
unresolvedPolicies, | ||
ll.map( | ||
newEnrichResolution -> action.apply(indexResolution, lookupIndexResolution, newEnrichResolution) | ||
) | ||
); | ||
return; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: wondering if this CCS logic could be extracted into a method for better legibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. But I think that should be done separately. I've tried to keep the CCS code as similar as possible to what was there before to reduce conflicts with the people working on CCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I hope I could clean up the EsqlSession code a bit with #116755.
@@ -401,6 +424,25 @@ private void preAnalyzeIndices( | |||
} | |||
} | |||
|
|||
private void preAnalyzeLookupIndices(List<TableInfo> indices, Set<String> fieldNames, ActionListener<IndexResolution> listener) { | |||
if (indices.size() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have this tracked somewhere? ENRICH maps policies to distinct resolutions and thus allows arbitrary enriching steps, we'll probably want to allow this at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the overall project plan has steps for improving this, and it requires changes in many more places. This PR focuses on keeping the same minimal test set working, while simply replacing the hard-coded index mappings with actual field-caps calls. I think a followup PR should deal with multiple index mappings (and multiple field-caps calls).
@@ -440,6 +483,11 @@ static Set<String> fieldNames(LogicalPlan parsed, Set<String> enrichPolicyMatchF | |||
// The exact name of the field will be added later as part of enrichPolicyMatchFields Set | |||
enrichRefs.removeIf(attr -> attr instanceof EmptyAttribute); | |||
references.addAll(enrichRefs); | |||
} else if (p instanceof LookupJoin join) { | |||
keepJoinReferences.addAll(join.config().matchFields()); // TODO: why is this empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to have a look: the LookupJoin is created with empty join fields, and the join fields are extracted from the UsingJoinType
only later in the optimiser (when surrogate()
is invoked) -- I suppose using the empty fields needs fixing (at some point before or after this PR?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR, hence the TODO to remind us to fix this. I view this work as a first minimal fix to the hard-coded field mappings, and want it to be clean, simple and easy to review.
Some tests might help -- using an inexistent join field isn't caught by the verifier (but still caught, though too late, in |
Indeed, more tests is probably good. I'd personally use the yaml stuff to build different lookup indices scenarios and try them. |
I guess you'd be the first to make YAML tests for this. |
I had hoped that this PR would simply fix the hard-coded mapping, making sure the existing csv-spec tests still worked. Increasing the tests will no-doubt expose quite a few serious limitations in the current stack for LOOKUP JOIN, which I assumed would be for a followup PR to tackle. If we do that in this PR, then we should do it because we do want this PR to be a much broader PR. My vote is lets make 2 PRs, this one does only what it claims in the description, and a followup that broadens the scope of the LOOKUP JOIN to handle multiple joins, and a broader suite of test cases. I can start on that directly, but would like this merged first. |
That's fine with me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* LOOKUP JOIN using field-caps for field mapping Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index. * Update docs/changelog/117246.yaml * Some code review comments
This is missing a backport to 8.x. |
The missing backport is blocking #116755 backport. |
* LOOKUP JOIN using field-caps for field mapping Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index. * Update docs/changelog/117246.yaml * Some code review comments
Looks like backport is coming in #117967 |
* LOOKUP JOIN using field-caps for field mapping (#117246) * LOOKUP JOIN using field-caps for field mapping Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index. * Update docs/changelog/117246.yaml * Some code review comments * Enhance LOOKUP JOIN csv-spec tests to cover more cases and fix several bugs found (#117843) Adds several more tests to lookup-join.csv-spec, and fixes the following bugs: * FieldCaps on right hand side should ignore fieldNames method and just use "*" because currently the fieldNames search cannot handle lookup fields with aliases (should be fixed in a followup PR). * Stop using the lookup index in the ComputeService (so we don’t get both indices data coming in from the left, and other weird behaviour). * Ignore failing SearchStats checks on fields from the right hand side in the logical planner (so it does not plan EVAL field = null for all right hand fields). This should be fixed properly with the correct updates to TransportSearchShardsAction (or rather to making multiple use of that for each branch of the execution model). * Don't load indices with mode:lookup due to cluster state errors in mixed clusters * Disable all lookup-join tests on 8.x, due to issues with cluster state * Spotless apply
Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index.
This is based on the initial work done in #116515
For background look at the meta-task for
LOOKUP JOIN
at #116208.So basically think of it as a JOIN, and internally it is coded similarly to ENRICH. What we do for ENRICH, however, is get the field mappings by reading the enrich-policy. For the JOIN we have no such thing and need to get the field mappings from field_caps. When the query has a JOIN, there will be two
UnresolvedRelation
instances, one coming from theFROM mainindex
clause and supporting a bunch of advanced stuff like wildcards and CCS, and the second coming from theLOOKUP JOIN otherindex ON joinfield
call, and supporting much less (just single index name, no wildcards and no CCS). So we need to support three independent field mappings mechanisms:ENRICH
(asynchronously read the enrich-policy)LOOKUP JOIN
(asynchronously call field-caps on theotherindex
with potentially needed field names, the join field name, as well as whatever the query is expecting to see later in the results)FROM
(asynchronously call field-caps on themainindex
with potentially needed field names, from the query plan. The index name could have wildcards and CCS support)