LOOKUP JOIN using field-caps for field mapping #117246

craigtaverner · 2024-11-21T12:54:59Z

Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index.

This is based on the initial work done in #116515

For background look at the meta-task for LOOKUP JOIN at #116208.

So basically think of it as a JOIN, and internally it is coded similarly to ENRICH. What we do for ENRICH, however, is get the field mappings by reading the enrich-policy. For the JOIN we have no such thing and need to get the field mappings from field_caps. When the query has a JOIN, there will be two UnresolvedRelation instances, one coming from the FROM mainindex clause and supporting a bunch of advanced stuff like wildcards and CCS, and the second coming from the LOOKUP JOIN otherindex ON joinfield call, and supporting much less (just single index name, no wildcards and no CCS). So we need to support three independent field mappings mechanisms:

ENRICH (asynchronously read the enrich-policy)
LOOKUP JOIN (asynchronously call field-caps on the otherindex with potentially needed field names, the join field name, as well as whatever the query is expecting to see later in the results)
FROM (asynchronously call field-caps on the mainindex with potentially needed field names, from the query plan. The index name could have wildcards and CCS support)

Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index.

elasticsearchmachine · 2024-11-21T12:55:23Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2024-11-21T12:55:47Z

Hi @craigtaverner, I've created a changelog YAML for you.

nik9000

LGTM. Wants a review from @astefan because he's refactoring this code and can probably answer questions like "TODO: why is this empty". If we can't get through all of the outstanding TODOs but @astefan is still happy let's get this merged because it's way better than my hack. We should just make sure to lodge the TODOs into the meta issue.

nik9000 · 2024-11-21T14:08:43Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

    ) {
-        PreAnalyzer.PreAnalysis preAnalysis = new PreAnalyzer().preAnalyze(parsed);


nik9000 · 2024-11-21T14:08:45Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

        }));
    }

    private void preAnalyzeIndices(
-        LogicalPlan parsed,
+        List<TableInfo> indices,


bpintea

Just some fly-by questions.

bpintea · 2024-11-21T16:21:30Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/AnalyzerContext.java

    EnrichResolution enrichResolution
-) {}
+) {
+    public AnalyzerContext(


Is this for testing-only use? Would it be worth creating an instance for those purposes? Or a method as a testing util creating it? Or maybe leaving a comment?

It was for tests, but I did not want to follow the test approach to enrichResolution, which has three different ways of passing in an empty resolution, cluttering the tests. So this is an intermediate solution, which minimizes code changes (does not change test code), while getting the goal of the PR achieved. Then I would like a second PR that simplifies both this and the enrich resolution for tests (and cleans up test code all over the place).

bpintea · 2024-11-21T16:26:32Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java

            if (plan.indexMode().equals(IndexMode.LOOKUP)) {
-                return hackLookupMapping(plan);
+                return resolveIndex(plan, context.lookupResolution());
            }
-            if (context.indexResolution().isValid() == false) {
-                return plan.unresolvedMessage().equals(context.indexResolution().toString())
+            return resolveIndex(plan, context.indexResolution());


Nit: could be
return resolveIndex(plan, plan.indexMode().equals(IndexMode.LOOKUP) ? context.lookupResolution() : context.indexResolution());

OK. I've made that simplification. My reason for the original version was I thought it looked more like the old code, so might be easier to review.

bpintea · 2024-11-21T17:03:17Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

+                        lx.delegateFailureAndWrap((ll, indexResolution) -> {
+                            // TODO in follow-PR (for skip_unavailble handling of missing concrete indexes) add some tests for invalid
+                            // index resolution to updateExecutionInfo
+                            if (indexResolution.isValid()) {
+                                EsqlSessionCCSUtils.updateExecutionInfoWithClustersWithNoMatchingIndices(executionInfo, indexResolution);
+                                EsqlSessionCCSUtils.updateExecutionInfoWithUnavailableClusters(
+                                    executionInfo,
+                                    indexResolution.unavailableClusters()
+                                );
+                                if (executionInfo.isCrossClusterSearch()
+                                    && executionInfo.getClusterStateCount(EsqlExecutionInfo.Cluster.Status.RUNNING) == 0) {
+                                    // for a CCS, if all clusters have been marked as SKIPPED, nothing to search so send a sentinel
+                                    // Exception to let the LogicalPlanActionListener decide how to proceed
+                                    ll.onFailure(new NoClustersToSearchException());
+                                    return;
+                                }
+
+                                Set<String> newClusters = enrichPolicyResolver.groupIndicesPerCluster(
+                                    indexResolution.get().concreteIndices().toArray(String[]::new)
+                                ).keySet();
+                                // If new clusters appear when resolving the main indices, we need to resolve the enrich policies again
+                                // or exclude main concrete indices. Since this is rare, it's simpler to resolve the enrich policies
+                                // again.
+                                // TODO: add a test for this
+                                if (targetClusters.containsAll(newClusters) == false
+                                    // do not bother with a re-resolution if only remotes were requested and all were offline
+                                    && executionInfo.getClusterStateCount(EsqlExecutionInfo.Cluster.Status.RUNNING) > 0) {
+                                    enrichPolicyResolver.resolvePolicies(
+                                        newClusters,
+                                        unresolvedPolicies,
+                                        ll.map(
+                                            newEnrichResolution -> action.apply(indexResolution, lookupIndexResolution, newEnrichResolution)
+                                        )
+                                    );
+                                    return;
+                                }
+                            }


Nit: wondering if this CCS logic could be extracted into a method for better legibility.

Agreed. But I think that should be done separately. I've tried to keep the CCS code as similar as possible to what was there before to reduce conflicts with the people working on CCS.

Indeed, I hope I could clean up the EsqlSession code a bit with #116755.

bpintea · 2024-11-21T17:19:19Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

@@ -401,6 +424,25 @@ private void preAnalyzeIndices(
        }
    }

+    private void preAnalyzeLookupIndices(List<TableInfo> indices, Set<String> fieldNames, ActionListener<IndexResolution> listener) {
+        if (indices.size() > 1) {


Do we have this tracked somewhere? ENRICH maps policies to distinct resolutions and thus allows arbitrary enriching steps, we'll probably want to allow this at some point.

I believe the overall project plan has steps for improving this, and it requires changes in many more places. This PR focuses on keeping the same minimal test set working, while simply replacing the hard-coded index mappings with actual field-caps calls. I think a followup PR should deal with multiple index mappings (and multiple field-caps calls).

bpintea · 2024-11-21T18:18:16Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

@@ -440,6 +483,11 @@ static Set<String> fieldNames(LogicalPlan parsed, Set<String> enrichPolicyMatchF
                // The exact name of the field will be added later as part of enrichPolicyMatchFields Set
                enrichRefs.removeIf(attr -> attr instanceof EmptyAttribute);
                references.addAll(enrichRefs);
+            } else if (p instanceof LookupJoin join) {
+                keepJoinReferences.addAll(join.config().matchFields());  // TODO: why is this empty


Had to have a look: the LookupJoin is created with empty join fields, and the join fields are extracted from the UsingJoinType only later in the optimiser (when surrogate() is invoked) -- I suppose using the empty fields needs fixing (at some point before or after this PR?).

After this PR, hence the TODO to remind us to fix this. I view this work as a first minimal fix to the hard-coded field mappings, and want it to be clean, simple and easy to review.

bpintea · 2024-11-21T18:20:06Z

Some tests might help -- using an inexistent join field isn't caught by the verifier (but still caught, though too late, in LogicalVerifier).

nik9000 · 2024-11-21T20:16:02Z

Indeed, more tests is probably good. I'd personally use the yaml stuff to build different lookup indices scenarios and try them.

nik9000 · 2024-11-21T20:16:13Z

I guess you'd be the first to make YAML tests for this.

…sticsearch into lookup_join_field_caps

craigtaverner · 2024-11-22T12:56:40Z

Indeed, more tests is probably good. I'd personally use the yaml stuff to build different lookup indices scenarios and try them.

I had hoped that this PR would simply fix the hard-coded mapping, making sure the existing csv-spec tests still worked. Increasing the tests will no-doubt expose quite a few serious limitations in the current stack for LOOKUP JOIN, which I assumed would be for a followup PR to tackle. If we do that in this PR, then we should do it because we do want this PR to be a much broader PR. My vote is lets make 2 PRs, this one does only what it claims in the description, and a followup that broadens the scope of the LOOKUP JOIN to handle multiple joins, and a broader suite of test cases. I can start on that directly, but would like this merged first.

nik9000 · 2024-11-22T14:14:37Z

Increasing the tests will no-doubt expose quite a few serious limitations in the current stack for LOOKUP JOIN

That's fine with me.

astefan

LGTM 👍

costin

👍

* LOOKUP JOIN using field-caps for field mapping Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index. * Update docs/changelog/117246.yaml * Some code review comments

alex-spies · 2024-12-03T18:50:15Z

This is missing a backport to 8.x.

astefan · 2024-12-04T08:09:02Z

The missing backport is blocking #116755 backport.

* LOOKUP JOIN using field-caps for field mapping Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index. * Update docs/changelog/117246.yaml * Some code review comments

nik9000 · 2024-12-04T20:32:06Z

Looks like backport is coming in #117967

* LOOKUP JOIN using field-caps for field mapping (#117246) * LOOKUP JOIN using field-caps for field mapping Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index. * Update docs/changelog/117246.yaml * Some code review comments * Enhance LOOKUP JOIN csv-spec tests to cover more cases and fix several bugs found (#117843) Adds several more tests to lookup-join.csv-spec, and fixes the following bugs: * FieldCaps on right hand side should ignore fieldNames method and just use "*" because currently the fieldNames search cannot handle lookup fields with aliases (should be fixed in a followup PR). * Stop using the lookup index in the ComputeService (so we don’t get both indices data coming in from the left, and other weird behaviour). * Ignore failing SearchStats checks on fields from the right hand side in the logical planner (so it does not plan EVAL field = null for all right hand fields). This should be fixed properly with the correct updates to TransportSearchShardsAction (or rather to making multiple use of that for each branch of the execution model). * Don't load indices with mode:lookup due to cluster state errors in mixed clusters * Disable all lookup-join tests on 8.x, due to issues with cluster state * Spotless apply

LOOKUP JOIN using field-caps for field mapping

5823608

Removes the hard-coded hack for languages_lookup, and instead does a field-caps check for the real join index.

craigtaverner added >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Nov 21, 2024

craigtaverner requested review from costin and alex-spies November 21, 2024 12:54

elasticsearchmachine added the v9.0.0 label Nov 21, 2024

craigtaverner added 2 commits November 21, 2024 13:55

Merge remote-tracking branch 'origin/main' into lookup_join_field_caps

934a427

Update docs/changelog/117246.yaml

5815db7

craigtaverner mentioned this pull request Nov 21, 2024

ESQL: Lookup Join meta issue #116208

Closed

48 tasks

nik9000 approved these changes Nov 21, 2024

View reviewed changes

bpintea reviewed Nov 21, 2024

View reviewed changes

alex-spies requested a review from astefan November 22, 2024 10:00

craigtaverner added 3 commits November 22, 2024 11:35

Merge branch 'lookup_join_field_caps' of github.com:craigtaverner/ela…

798bf79

…sticsearch into lookup_join_field_caps

Merge remote-tracking branch 'origin/main' into lookup_join_field_caps

6499e25

Some code review comments

beeccfc

Merge remote-tracking branch 'origin/main' into lookup_join_field_caps

5d10d0a

astefan approved these changes Nov 22, 2024

View reviewed changes

costin approved these changes Nov 25, 2024

View reviewed changes

craigtaverner merged commit 32aaacb into elastic:main Nov 25, 2024
16 checks passed

craigtaverner mentioned this pull request Dec 4, 2024

[8.x] Backport two PRs (#117246) (#117843) #117967

Merged

alex-spies added the v8.18.0 label Dec 4, 2024

		) {
		PreAnalyzer.PreAnalysis preAnalysis = new PreAnalyzer().preAnalyze(parsed);

LOOKUP JOIN using field-caps for field mapping #117246

LOOKUP JOIN using field-caps for field mapping #117246

Uh oh!

Conversation

craigtaverner commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 21, 2024

Uh oh!

elasticsearchmachine commented Nov 21, 2024

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea commented Nov 21, 2024

Uh oh!

nik9000 commented Nov 21, 2024

Uh oh!

nik9000 commented Nov 21, 2024

Uh oh!

craigtaverner commented Nov 22, 2024

Uh oh!

nik9000 commented Nov 22, 2024

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

costin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alex-spies commented Dec 3, 2024

Uh oh!

astefan commented Dec 4, 2024

Uh oh!

nik9000 commented Dec 4, 2024

Uh oh!

Uh oh!

craigtaverner commented Nov 21, 2024 •

edited

Loading