ESQL: Push more ==s on text fields to lucene (backport) #128156

nik9000 · 2025-05-19T18:34:00Z

If you do:

| WHERE text_field == "cat"

we can't push to the text field because it's search index is for
individual words. But most text fields have a .keyword sub field and
we can query it's index. EXCEPT! It's normal for these fields to have
ignore_above in their mapping. In that case we don't push to the
field. Very sad.

With this change we can push down ==, but only when the right hand
side is shorter than the ignore_above.

This has pretty much infinite speed gain. An example using a million
documents:

Before:  "took" : 391,
 After:  "took" :   4,

But this is going from totally un-indexed linear scans to totally
indexed. You can make the "Before" number as high as you want by loading
more data.

Reenables text == pushdown and adds support for text != pushdown.

It does so by making TranslationAware#translatable return something
we can turn into a tri-valued function. It has these values:

YES
NO
RECHECK

YES means the Expression is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.

NO means the Expression can't be pushed to Lucene at all and will stay
in the plan.

RECHECK mean the Expression can push a query that makes candidate
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene and left in the plan.

This is required because txt != "b" can build a candidate query
against the txt.keyword subfield but it can't be sure of the match
without loading the _source - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic
performance tests:

Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}

This comes from:

rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

for a in {1..100}; do
    text_eq
    not_text_eq
done

If you do: ``` | WHERE text_field == "cat" ``` we can't push to the text field because it's search index is for individual words. But most text fields have a `.keyword` sub field and we *can* query it's index. EXCEPT! It's normal for these fields to have `ignore_above` in their mapping. In that case we don't push to the field. Very sad. With this change we can push down `==`, but only when the right hand side is shorter than the `ignore_above`. This has pretty much infinite speed gain. An example using a million documents: ``` Before: "took" : 391, After: "took" : 4, ``` But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.

The PR elastic#126641 has a bug with `!=`.

Had to sort. Relates to elastic#127199

Reenables `text ==` pushdown and adds support for `text !=` pushdown. It does so by making `TranslationAware#translatable` return something we can turn into a tri-valued function. It has these values: * `YES` * `NO` * `RECHECK` `YES` means the `Expression` is entirely pushable into Lucene. They will be pushed into Lucene and removed from the plan. `NO` means the `Expression` can't be pushed to Lucene at all and will stay in the plan. `RECHECK` mean the `Expression` can push a query that makes *candidate* matches but must be rechecked. Documents that don't match the query won't match the expression, but documents that match the query might not match the expression. These are pushed to Lucene *and* left in the plan. This is required because `txt != "b"` can build a *candidate* query against the `txt.keyword` subfield but it can't be sure of the match without loading the `_source` - which we do in the compute engine. I haven't plugged rally into this, but here's some basic performance tests: ``` Before: not text eq {"took":460,"documents_found":1000000} text eq {"took":432,"documents_found":1000000} After: text eq {"took":5,"documents_found":1} not text eq {"took":351,"documents_found":800000} ``` This comes from: ``` rm -f /tmp/bulk* for a in {1..1000}; do echo '{"index":{}}' >> /tmp/bulk echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk done ls -l /tmp/bulk* passwd="redacted" curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{ "settings": { "index.codec": "best_compression", "index.refresh_interval": -1 }, "mappings": { "properties": { "many": { "enabled": false } } } }' for a in {1..1000}; do printf %04d: $a curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors done curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1 curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh echo curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v text_eq() { echo -n " text eq " curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{ "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)", "pragma": { "data_partitioning": "shard" } }' | jq -c '{took, documents_found}' } not_text_eq() { echo -n "not text eq " curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{ "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)", "pragma": { "data_partitioning": "shard" } }' | jq -c '{took, documents_found}' } for a in {1..100}; do text_eq not_text_eq done ```

When adding support for pushing `==` to `semantic_text` we were incorrectly asserting that all queries to that field used nested documents. That's normal for `semantic_text`, but sometimes we query indices that don't have any nested fields. Closes elastic#128122

elasticsearchmachine · 2025-05-19T18:35:25Z

Hi @nik9000, I've created a changelog YAML for you.

elasticsearchmachine · 2025-05-19T18:35:27Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

Fix #130977 Fix #130976 I'm just muting all the csv tests from #128156 in the 8.12->8.19 bwc scenario; previously, it looked like only 2 of those failed but it looks like we're generally incompatible here. Which is fine because 8.12 was pre-GA for ES|QL and is outside the support matrix, anyway.

nik9000 added 6 commits May 19, 2025 13:21

ESQL: Disable a bugged commit (elastic#127199)

40000c7

The PR elastic#126641 has a bug with `!=`.

ESQL: Make a test consistent (elastic#127234)

e0b4471

Had to sort. Relates to elastic#127199

ESQL: Fix a test bug

038f5ff

When adding support for pushing `==` to `semantic_text` we were incorrectly asserting that all queries to that field used nested documents. That's normal for `semantic_text`, but sometimes we query indices that don't have any nested fields. Closes elastic#128122

Fixup

177b698

elasticsearchmachine added v8.19.0 needs:triage Requires assignment of a team area label labels May 19, 2025

nik9000 added >feature :Analytics/ES|QL AKA ESQL and removed needs:triage Requires assignment of a team area label labels May 19, 2025

nik9000 mentioned this pull request May 19, 2025

ESQL: Disable a bugged commit #127199

Merged

Update docs/changelog/128156.yaml

ce125ef

nik9000 mentioned this pull request May 19, 2025

ESQL: Fix a test bug #128146

Merged

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 19, 2025

This was referenced May 19, 2025

ESQL: text == and text != pushdown #127355

Merged

ESQL: Push more ==s on text fields to lucene #126641

Merged

nik9000 merged commit 6fb106d into elastic:8.19 May 19, 2025
15 checks passed

nik9000 mentioned this pull request May 27, 2025

ESQL: Raise timeout on test suite #128525

Merged

dnhatn mentioned this pull request May 27, 2025

[ES|QL] WHERE Command filtering can be significantly slower than DSL #128529

Open

This was referenced Jul 9, 2025

[CI] MixedClusterEsqlSpecIT class failing #128224

Closed

[CI] MixedClusterEsqlSpecIT test {string.MvStringNotEquals SYNC} failing #130858

Open

ESQL: Fix more mv string bwc tests for string not/equals #131072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ESQL: Push more ==s on text fields to lucene (backport) #128156

ESQL: Push more ==s on text fields to lucene (backport) #128156

Uh oh!

nik9000 commented May 19, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented May 19, 2025

Uh oh!

elasticsearchmachine commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

ESQL: Push more ==s on text fields to lucene (backport) #128156

ESQL: Push more ==s on text fields to lucene (backport) #128156

Uh oh!

Conversation

nik9000 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented May 19, 2025

Uh oh!

elasticsearchmachine commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

nik9000 commented May 19, 2025 •

edited

Loading