Skip to content

_analyzer api inconsistency without index name #115930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

drempapis
Copy link
Contributor

@drempapis drempapis commented Oct 30, 2024

When an existing "index" is not defined in the _analyze call, e.g.

GET _analyze
{
  "analyzer": "standard", 
  "text": [ "a", "b"]
}

The target Analyzer is retrieved from the analysisRegistry, which returns a NamedAnalyzer from a prebuiltAnalysisProvider at which the positionIncrementGap will have the value Integer.MIN_VALUE, or a Lucene Analyzer object where the value is set to 0.

I was unsure where to update the value, so I experimented with overriding a NameAnalyzer on the TranspotAnalyzeAction side and defining the positionIncrementGap to the proper value.

When a Lucene analyzer returns, the problem remains the same.

solves #29021

@drempapis drempapis added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.0.0 labels Oct 30, 2024
@drempapis drempapis self-assigned this Oct 30, 2024
@drempapis drempapis requested a review from javanna October 30, 2024 12:04
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@drempapis drempapis changed the title Add a quick fix to mitigate the issue and a test to prove functionality #29021 _analyzer api inconsistency without index name #29021 Oct 30, 2024
@drempapis drempapis changed the title _analyzer api inconsistency without index name #29021 _analyzer api inconsistency without index name Oct 30, 2024
Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments, thanks!

return analyzer;
return analyzer instanceof NamedAnalyzer
? new NamedAnalyzer((NamedAnalyzer)analyzer, TextFieldMapper.Defaults.POSITION_INCREMENT_GAP)
: analyzer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix seems to work, but I wonder if it's a bit fragile. What is the underlying difference between the case with an index and that without an index? Is there a way to unify the too somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @javanna for the feedback.

When reaching the TransportAnalyzeAction class, the method getAnalyzer retrieves the target Analyzer. When the index is provided, the indexService is called, which has been bootstrapped during the AnalysisRegistry::build::produceAnalyzer, overriding the gap to 100.

Without an index in the rest call, the AnalysisRegistry returns the analyzer. In that case, the returned object can be of type NamedAnalyzer (the gap is overridden to Integer.MIN_VALUE) or a plain Lucene Analyzer object (gap defaults to 0). For both cases, no method is available to override the value of the gap.

I identified three ways to solve it so far.

  • The first was to override the NameAnalyzer after retrieving it from the AnalysisRegistry. This applies only to the objects of type NamedAnalyzer
  • The second could be to hardcode the incrementGap into the calculation method TransportAnalyzeAction::simpleAnalyze to add 100 when detects, e.g., a negative number
  • Another way would be to create NamedAnalyzer objects with the value 100 instead of Inger:Min.

I haven't yet considered unifying the logic. Further investigation is required for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive-by commenting, just because @drempapis and I talked about this earlier:

The second could be to hardcode the incrementGap into the calculation method

while that's possible I noted that such an approach would need to take care of both "simpleAnalyze()" and detailAnalyze()" branches, so two places that would need changing and consideration.
I would say the wrapping approach looks the most reasonable to me, looking at the code I only wondered when we'd ever be in a position there the non-index code branch would return a NamedAnalyzer from the registry? Do you have a specific case for that at hand? If that's not possible (and I think we only create NamedAnalyzers as part of index analyzers at the moment), then an even more light-weight version instead of wrapping with a NamedAnalyzer would be to wrap in a DelegatingAnalyzerWrapper and only overwrite "getPositionIncrementGap" with the ES default of 100. As far as I can tell the only way to modify this value is in text field mappings which in turn should only be there if we specify an index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. How difficult would it be to fake an IndexService when an index is not provided instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a fake IndexService is complex and requires many objects to mock.

Internally overrides the NamedAnalyzer object similarly to how we do it here. Is there any other way of creating an IndexService that I may have missed?

I experimented with the following approach of creating a dummyIndexService in the TransportAnalyzerAction class.

    private static IndexService craeteDummyIndexService(IndicesService indicesService) throws IOException {
        Index index = new Index(UUIDs.randomBase64UUID(), UUIDs.randomBase64UUID());
        Settings.Builder builder = Settings.builder()
            .put(IndexMetadata.SETTING_VERSION_CREATED, IndexVersion.current())
            .put(SETTING_INDEX_UUID, index.getUUID());

        IndexMetadata indexMetadata = new IndexMetadata.Builder(index.getName()).settings(builder.build())
            .numberOfShards(1)
            .numberOfReplicas(0)
            .build();
         return indicesService.createIndex(indexMetadata, Collections.emptyList(), false);
    }

This approach can return correctly for the non-index _analyzer api-calls but fails eventually with the exception.

[2024-11-05T10:45:21,605][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [runTask-0] fatal error in thread [elasticsearch[runTask-0][clusterApplierService#updateTask][T#1]], exiting java.lang.AssertionError: index [ClMZwlqSQKSIeOZd-cgguA/D8sWGYUfSQu5vd9HQsVgvA] does not exist in the cluster state, it should either have been deleted or the cluster must be new
        at org.elasticsearch.server@9.0.0-SNAPSHOT/org.elasticsearch.indices.cluster.IndicesClusterStateService.removeIndicesAndShards(IndicesClusterStateService.java:468)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for considering faking the index service. I agree that's not easy. I looked a bit deeper at AnalysisRegistry, and I see that it already has the logic for wrapping the NamedAnalyzer in produceAnalyzer. I wonder if we should leverage that somehow in the analyze api as well. I see that getAnalyzer is only used by the analyze api, and the categorize text agg. I even wonder if the latter may benefit from the same fix around position increment gap. So, should the fix be rather made in getAnalyzer, and perhaps the code shared between produceAnalyzer and getAnalyzer, to make sure they stay in sync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @javanna. I added that logic to the API call defensively, but I'll also consider the option, verify how the categorize text agg works, and unify the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javanna I changed the getAnalyzer method as proposed, which works well for both the Analyze API and Categorize Text Agg. When an index is bootstrapped, the produceAnalyzer is called, requiring many objects to build and pass in the function call. Using the produceAnalyzer method when calling the getAnalyzer (in the case of no index provided) would result in more significant overhead and many intermediate classes to construct, which is unnecessary for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not necessarily mean that we need to call produceAnalyzer, but we could share the wrapping part between the two methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I provided a method for both cases, wrapping the creation of a new object.

@javanna javanna added the :Search Relevance/Analysis How text is split into tokens label Oct 31, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 31, 2024
@javanna
Copy link
Member

javanna commented Oct 31, 2024

I marked this search relevance as well because it belongs to the analysis area, perhaps folks in that team will have opinions too.

…tead of hardcoding the expected response when an index is not provided, the output is compared to the output of the call when an index is present. We also iterate over all prebuilt analyzers available in the test class context.
@drempapis
Copy link
Contributor Author

@elasticmachine update branch

@drempapis
Copy link
Contributor Author

@elasticmachine update branch

} else {
analyzer = new NamedAnalyzer(name, analyzerFactory.scope(), analyzerF, overridePositionIncrementGap);
}
checkVersions(analyzer);
return analyzer;
}

private static NamedAnalyzer overrideAnalyzer(Analyzer analyzerF, int overridePositionIncrementGap) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take a NamedAnalyzer already, given we cast outside of this method?

Can we also call it overridePositionIncrementGap which seems more specific to what the method does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already set to NameAnalyzer the argument in the method. Maybe it needed to be refreshed.
Changed the name also to overridePositionIncrementGap.

@drempapis
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iterations, LGTM besides the two comments I left, no need for another review once those are addressed

} else {
analyzer = new NamedAnalyzer(name, analyzerFactory.scope(), analyzerF, overridePositionIncrementGap);
}
checkVersions(analyzer);
return analyzer;
}

private static NamedAnalyzer overridePositionIncrementGap(NamedAnalyzer analyzer, int overridePositionIncrementGap) {
if (overridePositionIncrementGap >= 0 && analyzer.getPositionIncrementGap(analyzer.name()) != overridePositionIncrementGap) {
// unless the positionIncrementGap needs to be overridden
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for nitpicking last minute, this comment is the second half of the previous one at line 728. Maybe they need to be either merged or both moved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to delete the comments. The code is self-descriptive and easy to follow and is being used in two places from different code paths

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ thanks

@@ -250,6 +251,32 @@ public void testFillsAttributes() throws IOException {
assertEquals("<ALPHANUM>", tokens.get(3).getType());
}

public void testMultiTextRequest() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind the name of this test? Is there a chance to make it self explanatory about what it tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "request," there should be more than one text and no index name.

GET _analyze
{
  "analyzer": "standard", 
  "text": ["a", "b"]
}

It could be a better and more descriptive choice. I'll change it to'testAnalyzerWithTwoTextsAndNoIndexName` instead. Thank you, @javanna, for pointing it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, perhaps using terms instead of text here would help :) the way I see it text is a set of terms, so it does not make a lot of sense to make it plural.

@drempapis drempapis added v8.17.0 auto-backport Automatically create backport pull requests when merged labels Nov 18, 2024
@drempapis drempapis merged commit 3253c2a into elastic:main Nov 18, 2024
16 checks passed
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Nov 18, 2024
Set positionIncrementGap to default (100) when no index is provided for the _analyze API
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
Set positionIncrementGap to default (100) when no index is provided for the _analyze API
Philippus added a commit to Philippus/elastic4s that referenced this pull request Mar 17, 2025
Philippus added a commit to Philippus/elastic4s that referenced this pull request May 7, 2025
* Update elasticsearch-rest-client, ... to 9.0.0-beta1

and docker images

* Fix test because of elastic/elasticsearch#115930

* Update elasticsearch-rest-client, ... to 9.0.0-rc1

and docker images

* Update elasticsearch-rest-client, ... to 9.0.0

and docker images
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Search Foundations/Search Catch all for Search Foundations :Search Relevance/Analysis How text is split into tokens Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants