Skip to content

[ES|QL] Rerank operator improvements #132318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

afoucret
Copy link
Contributor

@afoucret afoucret commented Aug 1, 2025

This PR introduces several enhancements to ES|QL's RERANK command.

RERANK Input Validation:

  • The RERANK ... ON clause now validates that the provided fields are of a supported data type (string, numeric, or boolean). This prevents runtime errors when attempting to rerank on incompatible fields like dates or IPs.
  • If a single field is used we are casting it to a string automatically.
    On multiple fields, the whole content is encoded in YAML so it is not necessary
  • Added validation tests in AnalyzerTests for supported / unsupported field types
  • Added some test case for non-textual fields in CSV specs

Sparse Data Handling:

  • Improve Sparse Data Handling: Enhance RERANK operator to correctly handle null or missing values in input field
  • When the input value to be reranked we returns null (0 does not make sense in the context of reranker model since the min score can be < 0).
  • Also improved the XContentRowEncoder (in charge of the YAML conversion when multiple fields are used), so it returns null if all fields are null (empty YAML before)

Bug Fixes & Testing:

  • A bug in XContentRowEncoder that caused a leading space in the output
  • A new, comprehensive unit test (XContentRowEncoderTests) has been added to cover the functionality of the XContentRowEncoder and prevent future regressions
  • Existing tests for RERANK and COMPLETION have been updated to use a new test helper for reading block data and to assert correct behavior with sparse inputs.

@afoucret afoucret added >non-issue v9.2.0 :Search Relevance/ES|QL Search functionality in ES|QL labels Aug 1, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Aug 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@@ -25,6 +25,9 @@
},
"year": {
"type": "integer"
},
"collection": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Added a new column to the dataset with sparse data, so we can test some sparse behavior.

;

book_no:keyword | title:text | author:text | collection:text | rerank_score:double | _score:double
2714 | Return of the King Being the Third Part of The Lord of the Rings | J. R. R. Tolkien | The Lord of the Rings | 0.04761905 | 8.56
Copy link
Contributor Author

@afoucret afoucret Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Testing that reranking return null when the input field is null

| KEEP book_no, title, ratings, _score
;

book_no:keyword | title:text | ratings:double | _score:double
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ It is stupid to rerank on a number but at least it does not break.

| KEEP book_no, title, ratings, _score
;

book_no:keyword | title:text | ratings:double | _score:double
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Combining text and non-text fields. Will be encoded in a YAML document that will be passed to the reranker.

if (castRerankFieldsAsString
&& rerank.isValidRerankField(resolved)
&& DataType.isString(resolved.dataType()) == false) {
resolved = resolved.replaceChild(new ToString(resolved.child().source(), resolved.child()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Casting non text input field to string,

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@afoucret afoucret enabled auto-merge (squash) August 1, 2025 12:16
@afoucret afoucret disabled auto-merge August 1, 2025 13:56
@afoucret afoucret enabled auto-merge (squash) August 1, 2025 13:57
@afoucret afoucret merged commit 75ac0ee into elastic:main Aug 1, 2025
33 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 1, 2025
…cking

* upstream/main: (166 commits)
  Reduce inactive sink interval in VectorSimilarityFunctionsIT (elastic#132288)
  ESQL: Allow agg tests to process many columns (elastic#132358)
  Update analysis-lowercase-tokenfilter.md (elastic#132359)
  Add Sparse Vector Index Options Settings to Semantic Text Field (elastic#131058)
  Collect node thread pool usage for shard balancing (elastic#131480)
  Add tasks to validate new style transport versions (elastic#131782)
  Mute org.elasticsearch.search.routing.SearchReplicaSelectionIT testNodeSelection elastic#132354
  Mute org.elasticsearch.xpack.esql.action.CrossClusterAsyncQueryIT testBadAsyncId elastic#132353
  Fixes DenseVectorFieldIndexTypeUpdateIT release tests (elastic#132346)
  Fix testCloseOrReallocateDuringPartialSnapshot (elastic#132049)
  (Doc) ILM Force Merge not on HDD and happens on hosting node not current phase tier (elastic#130280)
  Run GeoIp YAML tests in multi-project cluster and fix bug discovered by tests (elastic#131521)
  Unmutes elastic#132111, seems a transient, non reproducible issue (elastic#132253)
  Mute org.elasticsearch.search.suggest.phrase.PhraseSuggesterIT testPhraseSuggestionWithNgramOnlyAnalyzerThrowsException elastic#132347
  Add AI21 support to Inference Plugin (elastic#131238)
  OpenJDK EA builds should use https instead of http (elastic#132297)
  ESQL: Normalize timeseries aggs slightly (elastic#132284)
  Avoid internal server error on suggester ngram bad request (elastic#132321)
  [ES|QL] Rerank operator improvements (elastic#132318)
  Mute org.elasticsearch.xpack.logsdb.qa.LogsDbVersusReindexedLogsDbChallengeRestIT testTermsQuery elastic#132337
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Relevance/ES|QL Search functionality in ES|QL Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants