_analyzer api inconsistency without index name #115930

drempapis · 2024-10-30T12:03:49Z

When an existing "index" is not defined in the _analyze call, e.g.

GET _analyze
{
  "analyzer": "standard", 
  "text": [ "a", "b"]
}

The target Analyzer is retrieved from the analysisRegistry, which returns a NamedAnalyzer from a prebuiltAnalysisProvider at which the positionIncrementGap will have the value Integer.MIN_VALUE, or a Lucene Analyzer object where the value is set to 0.

I was unsure where to update the value, so I experimented with overriding a NameAnalyzer on the TranspotAnalyzeAction side and defining the positionIncrementGap to the proper value.

When a Lucene analyzer returns, the problem remains the same.

solves #29021

elasticsearchmachine · 2024-10-30T12:04:12Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

javanna

I left a couple of comments, thanks!

server/src/test/java/org/elasticsearch/action/admin/indices/TransportAnalyzeActionTests.java

javanna · 2024-10-31T10:17:55Z

server/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

-                return analyzer;
+                return analyzer instanceof NamedAnalyzer
+                       ? new NamedAnalyzer((NamedAnalyzer)analyzer, TextFieldMapper.Defaults.POSITION_INCREMENT_GAP)
+                       : analyzer;


This fix seems to work, but I wonder if it's a bit fragile. What is the underlying difference between the case with an index and that without an index? Is there a way to unify the too somehow?

Thanks @javanna for the feedback.

When reaching the TransportAnalyzeAction class, the method getAnalyzer retrieves the target Analyzer. When the index is provided, the indexService is called, which has been bootstrapped during the AnalysisRegistry::build::produceAnalyzer, overriding the gap to 100.

Without an index in the rest call, the AnalysisRegistry returns the analyzer. In that case, the returned object can be of type NamedAnalyzer (the gap is overridden to Integer.MIN_VALUE) or a plain Lucene Analyzer object (gap defaults to 0). For both cases, no method is available to override the value of the gap.

I identified three ways to solve it so far.

The first was to override the NameAnalyzer after retrieving it from the AnalysisRegistry. This applies only to the objects of type NamedAnalyzer

The second could be to hardcode the incrementGap into the calculation method TransportAnalyzeAction::simpleAnalyze to add 100 when detects, e.g., a negative number

Another way would be to create NamedAnalyzer objects with the value 100 instead of Inger:Min.

I haven't yet considered unifying the logic. Further investigation is required for that.

Sorry for the drive-by commenting, just because @drempapis and I talked about this earlier:

The second could be to hardcode the incrementGap into the calculation method

while that's possible I noted that such an approach would need to take care of both "simpleAnalyze()" and detailAnalyze()" branches, so two places that would need changing and consideration.
I would say the wrapping approach looks the most reasonable to me, looking at the code I only wondered when we'd ever be in a position there the non-index code branch would return a NamedAnalyzer from the registry? Do you have a specific case for that at hand? If that's not possible (and I think we only create NamedAnalyzers as part of index analyzers at the moment), then an even more light-weight version instead of wrapping with a NamedAnalyzer would be to wrap in a DelegatingAnalyzerWrapper and only overwrite "getPositionIncrementGap" with the ES default of 100. As far as I can tell the only way to modify this value is in text field mappings which in turn should only be there if we specify an index.

Thanks for the explanation. How difficult would it be to fake an IndexService when an index is not provided instead?

Creating a fake IndexService is complex and requires many objects to mock.

elasticsearch/server/src/main/java/org/elasticsearch/index/IndexService.java

Line 167 in 8afe2f4

public IndexService(

Internally overrides the NamedAnalyzer object similarly to how we do it here. Is there any other way of creating an IndexService that I may have missed?

elasticsearch/server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

Line 723 in 8afe2f4

if (analyzerF instanceof NamedAnalyzer) {

I experimented with the following approach of creating a dummyIndexService in the TransportAnalyzerAction class.

private static IndexService craeteDummyIndexService(IndicesService indicesService) throws IOException { Index index = new Index(UUIDs.randomBase64UUID(), UUIDs.randomBase64UUID()); Settings.Builder builder = Settings.builder() .put(IndexMetadata.SETTING_VERSION_CREATED, IndexVersion.current()) .put(SETTING_INDEX_UUID, index.getUUID()); IndexMetadata indexMetadata = new IndexMetadata.Builder(index.getName()).settings(builder.build()) .numberOfShards(1) .numberOfReplicas(0) .build(); return indicesService.createIndex(indexMetadata, Collections.emptyList(), false); }

This approach can return correctly for the non-index _analyzer api-calls but fails eventually with the exception.

[2024-11-05T10:45:21,605][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [runTask-0] fatal error in thread [elasticsearch[runTask-0][clusterApplierService#updateTask][T#1]], exiting java.lang.AssertionError: index [ClMZwlqSQKSIeOZd-cgguA/D8sWGYUfSQu5vd9HQsVgvA] does not exist in the cluster state, it should either have been deleted or the cluster must be new at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.indices.cluster.IndicesClusterStateService.removeIndicesAndShards(IndicesClusterStateService.java:468)

Thanks for considering faking the index service. I agree that's not easy. I looked a bit deeper at AnalysisRegistry, and I see that it already has the logic for wrapping the NamedAnalyzer in produceAnalyzer. I wonder if we should leverage that somehow in the analyze api as well. I see that getAnalyzer is only used by the analyze api, and the categorize text agg. I even wonder if the latter may benefit from the same fix around position increment gap. So, should the fix be rather made in getAnalyzer, and perhaps the code shared between produceAnalyzer and getAnalyzer, to make sure they stay in sync?

Thank you, @javanna. I added that logic to the API call defensively, but I'll also consider the option, verify how the categorize text agg works, and unify the logic.

@javanna I changed the getAnalyzer method as proposed, which works well for both the Analyze API and Categorize Text Agg. When an index is bootstrapped, the produceAnalyzer is called, requiring many objects to build and pass in the function call. Using the produceAnalyzer method when calling the getAnalyzer (in the case of no index provided) would result in more significant overhead and many intermediate classes to construct, which is unnecessary for this case.

I did not necessarily mean that we need to call produceAnalyzer, but we could share the wrapping part between the two methods?

I provided a method for both cases, wrapping the creation of a new object.

elasticsearchmachine · 2024-10-31T10:18:46Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

javanna · 2024-10-31T10:18:59Z

I marked this search relevance as well because it belongs to the analysis area, perhaps folks in that team will have opinions too.

server/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

server/src/test/java/org/elasticsearch/action/admin/indices/TransportAnalyzeActionTests.java

…tead of hardcoding the expected response when an index is not provided, the output is compared to the output of the call when an index is present. We also iterate over all prebuilt analyzers available in the test class context.

drempapis · 2024-11-08T07:21:27Z

@elasticmachine update branch

drempapis · 2024-11-13T10:57:55Z

@elasticmachine update branch

javanna · 2024-11-13T15:29:49Z

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

        } else {
            analyzer = new NamedAnalyzer(name, analyzerFactory.scope(), analyzerF, overridePositionIncrementGap);
        }
        checkVersions(analyzer);
        return analyzer;
    }

+    private static NamedAnalyzer overrideAnalyzer(Analyzer analyzerF, int overridePositionIncrementGap) {


We can take a NamedAnalyzer already, given we cast outside of this method?

Can we also call it overridePositionIncrementGap which seems more specific to what the method does?

It was already set to NameAnalyzer the argument in the method. Maybe it needed to be refreshed.
Changed the name also to overridePositionIncrementGap.

drempapis · 2024-11-14T08:18:07Z

@elasticmachine update branch

javanna

Thanks for the iterations, LGTM besides the two comments I left, no need for another review once those are addressed

javanna · 2024-11-14T15:57:08Z

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

        } else {
            analyzer = new NamedAnalyzer(name, analyzerFactory.scope(), analyzerF, overridePositionIncrementGap);
        }
        checkVersions(analyzer);
        return analyzer;
    }

+    private static NamedAnalyzer overridePositionIncrementGap(NamedAnalyzer analyzer, int overridePositionIncrementGap) {
+        if (overridePositionIncrementGap >= 0 && analyzer.getPositionIncrementGap(analyzer.name()) != overridePositionIncrementGap) {
+            // unless the positionIncrementGap needs to be overridden


sorry for nitpicking last minute, this comment is the second half of the previous one at line 728. Maybe they need to be either merged or both moved?

I decided to delete the comments. The code is self-descriptive and easy to follow and is being used in two places from different code paths

javanna · 2024-11-14T16:01:40Z

server/src/test/java/org/elasticsearch/action/admin/indices/TransportAnalyzeActionTests.java

@@ -250,6 +251,32 @@ public void testFillsAttributes() throws IOException {
        assertEquals("<ALPHANUM>", tokens.get(3).getType());
    }

+    public void testMultiTextRequest() throws IOException {


What is the rationale behind the name of this test? Is there a chance to make it self explanatory about what it tests?

In the "request," there should be more than one text and no index name.

GET _analyze { "analyzer": "standard", "text": ["a", "b"] }

It could be a better and more descriptive choice. I'll change it to'testAnalyzerWithTwoTextsAndNoIndexName` instead. Thank you, @javanna, for pointing it out.

I see, perhaps using terms instead of text here would help :) the way I see it text is a set of terms, so it does not make a lot of sense to make it plural.

Set positionIncrementGap to default (100) when no index is provided for the _analyze API

* Update elasticsearch-rest-client, ... to 9.0.0-beta1 and docker images * Fix test because of elastic/elasticsearch#115930 * Update elasticsearch-rest-client, ... to 9.0.0-rc1 and docker images * Update elasticsearch-rest-client, ... to 9.0.0 and docker images

Add a quick fix to mitigate the issue and a test to prove functionallity

c9173f4

drempapis added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.0.0 labels Oct 30, 2024

drempapis self-assigned this Oct 30, 2024

drempapis requested a review from javanna October 30, 2024 12:04

drempapis changed the title ~~Add a quick fix to mitigate the issue and a test to prove functionality #29021~~ _analyzer api inconsistency without index name #29021 Oct 30, 2024

drempapis changed the title ~~_analyzer api inconsistency without index name #29021~~ _analyzer api inconsistency without index name Oct 30, 2024

Added changelog

928406f

javanna reviewed Oct 31, 2024

View reviewed changes

javanna added the :Search Relevance/Analysis How text is split into tokens label Oct 31, 2024

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 31, 2024

Fix styling issue in the code

8db66d7

pmpailis reviewed Nov 1, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java Outdated Show resolved Hide resolved

pmpailis reviewed Nov 1, 2024

View reviewed changes

server/src/test/java/org/elasticsearch/action/admin/indices/TransportAnalyzeActionTests.java Outdated Show resolved Hide resolved

drempapis added 3 commits November 4, 2024 16:13

Merge branch 'main' into analyzer_api_inconsistency_without_index_name

f7ad18f

Merge branch 'main' into analyzer_api_inconsistency_without_index_name

4d9f478

elasticmachine and others added 5 commits November 8, 2024 08:21

Merge branch 'main' into analyzer_api_inconsistency_without_index_name

c1df93f

Changed the scope of the applied code into AnalysisRegistry class

c77ff1f

Add a method for overriden NamedAnlyzer construction

aba5268

Update method signature

ae5d89a

Add java 17+ feature

7615323

Merge branch 'main' into analyzer_api_inconsistency_without_index_name

cd2f5db

javanna reviewed Nov 13, 2024

View reviewed changes

Update method name after review

07c2631

Merge branch 'main' into analyzer_api_inconsistency_without_index_name

9bfe054

javanna approved these changes Nov 14, 2024

View reviewed changes

drempapis added 2 commits November 18, 2024 09:51

Merge branch 'main' into analyzer_api_inconsistency_without_index_name

2a47704

update code snippets after review

d55c634

drempapis added v8.17.0 auto-backport Automatically create backport pull requests when merged labels Nov 18, 2024

drempapis merged commit 3253c2a into elastic:main Nov 18, 2024
16 checks passed

salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Nov 18, 2024

_analyzer api inconsistency without index name (elastic#115930)

ae03f4d

Set positionIncrementGap to default (100) when no index is provided for the _analyze API

alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024

_analyzer api inconsistency without index name (elastic#115930)

25cadca

Set positionIncrementGap to default (100) when no index is provided for the _analyze API

Philippus added a commit to Philippus/elastic4s that referenced this pull request Mar 17, 2025

Fix test because of elastic/elasticsearch#115930

10c8aa2

_analyzer api inconsistency without index name #115930

_analyzer api inconsistency without index name #115930

Uh oh!

Conversation

drempapis commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 30, 2024

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Oct 31, 2024

Uh oh!

javanna commented Oct 31, 2024

Uh oh!

Uh oh!

Uh oh!

drempapis commented Nov 8, 2024

Uh oh!

drempapis commented Nov 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drempapis commented Nov 14, 2024

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

drempapis commented Oct 30, 2024 •

edited

Loading