Skip to content

ESQL: Push more ==s on text fields to lucene (backport) #128156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 19, 2025

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented May 19, 2025

If you do:

| WHERE text_field == "cat"

we can't push to the text field because it's search index is for
individual words. But most text fields have a .keyword sub field and
we can query it's index. EXCEPT! It's normal for these fields to have
ignore_above in their mapping. In that case we don't push to the
field. Very sad.

With this change we can push down ==, but only when the right hand
side is shorter than the ignore_above.

This has pretty much infinite speed gain. An example using a million
documents:

Before:  "took" : 391,
 After:  "took" :   4,

But this is going from totally un-indexed linear scans to totally
indexed. You can make the "Before" number as high as you want by loading
more data.

Reenables text == pushdown and adds support for text != pushdown.

It does so by making TranslationAware#translatable return something
we can turn into a tri-valued function. It has these values:

  • YES
  • NO
  • RECHECK

YES means the Expression is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.

NO means the Expression can't be pushed to Lucene at all and will stay
in the plan.

RECHECK mean the Expression can push a query that makes candidate
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene and left in the plan.

This is required because txt != "b" can build a candidate query
against the txt.keyword subfield but it can't be sure of the match
without loading the _source - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic
performance tests:

Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}

This comes from:

rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

for a in {1..100}; do
    text_eq
    not_text_eq
done

nik9000 added 6 commits May 19, 2025 13:21
If you do:
```
| WHERE text_field == "cat"
```
we can't push to the text field because it's search index is for
individual words. But most text fields have a `.keyword` sub field and
we *can* query it's index. EXCEPT! It's normal for these fields to have
`ignore_above` in their mapping. In that case we don't push to the
field. Very sad.

With this change we can push down `==`, but only when the right hand
side is shorter than the `ignore_above`.

This has pretty much infinite speed gain. An example using a million
documents:
```
Before:  "took" : 391,
 After:  "took" :   4,
```

But this is going from totally un-indexed linear scans to totally
indexed. You can make the "Before" number as high as you want by loading
more data.
Reenables `text ==` pushdown and adds support for `text !=` pushdown.

It does so by making `TranslationAware#translatable` return something
we can turn into a tri-valued function. It has these values:
* `YES`
* `NO`
* `RECHECK`

`YES` means the `Expression` is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.

`NO` means the `Expression` can't be pushed to Lucene at all and will stay
in the plan.

`RECHECK` mean the `Expression` can push a query that makes *candidate*
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene *and* left in the plan.

This is required because `txt != "b"` can build a *candidate* query
against the `txt.keyword` subfield but it can't be sure of the match
without loading the `_source` - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic
performance tests:
```
Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}
```

This comes from:
```
rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

for a in {1..100}; do
    text_eq
    not_text_eq
done
```
When adding support for pushing `==` to `semantic_text` we were
incorrectly asserting that all queries to that field used nested
documents. That's normal for `semantic_text`, but sometimes we query
indices that don't have any nested fields.

Closes elastic#128122
@elasticsearchmachine elasticsearchmachine added v8.19.0 needs:triage Requires assignment of a team area label labels May 19, 2025
@nik9000 nik9000 added >feature :Analytics/ES|QL AKA ESQL and removed needs:triage Requires assignment of a team area label labels May 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@nik9000 nik9000 mentioned this pull request May 19, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000 nik9000 merged commit 6fb106d into elastic:8.19 May 19, 2025
15 checks passed
elasticsearchmachine pushed a commit that referenced this pull request Jul 11, 2025
Fix #130977 Fix
#130976

I'm just muting all the csv tests from
#128156 in the 8.12->8.19
bwc scenario; previously, it looked like only 2 of those failed but it
looks like we're generally incompatible here. Which is fine because 8.12
was pre-GA for ES|QL and is outside the support matrix, anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants