Skip to content

ESQL: text == and text != pushdown #127355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 8, 2025
Merged

ESQL: text == and text != pushdown #127355

merged 24 commits into from
May 8, 2025

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Apr 24, 2025

Reenables text == pushdown and adds support for text != pushdown.

It does so by making TranslationAware#translatable return something we can turn into a tri-valued function. It has these values:

  • YES
  • NO
  • RECHECK

YES means the Expression is entirely pushable into Lucene. They will be pushed into Lucene and removed from the plan.

NO means the Expression can't be pushed to Lucene at all and will stay in the plan.

RECHECK mean the Expression can push a query that makes candidate matches but must be rechecked. Documents that don't match the query won't match the expression, but documents that match the query might not match the expression.
These are pushed to Lucene and left in the plan.

This is required because txt != "b" can build a candidate query against the txt.keyword subfield but it can't be sure of the match without loading the _source - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic performance tests:

Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}    

This comes from:

rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}


for a in {1..100}; do
    text_eq
    not_text_eq
done

nik9000 added 2 commits April 24, 2025 09:27
Reenables `text ==` pushdown and adds support for `text !=` pushdown.
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work but wants more docs and an explanation.

@nik9000
Copy link
Member Author

nik9000 commented Apr 25, 2025

This also wants more digging to be sure its right. This is the first PR that pushes things to lucene and rechecks them. The change for that was surprisingly small and I don't trust it to be that easy.

@nik9000 nik9000 marked this pull request as ready for review May 5, 2025 21:25
@nik9000
Copy link
Member Author

nik9000 commented May 5, 2025

OK! This is worth a look now!

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 5, 2025
@nik9000
Copy link
Member Author

nik9000 commented May 6, 2025

I'm going to make some performance numbers for this today.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment, but this looks great - especially the docs and samples. Thank you!

if (ft == null) {
return new MatchNoDocsQuery("missing field [" + field() + "]");
}
ft = ((TextFieldMapper.TextFieldType) ft).syntheticSourceDelegate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SemanticTextFieldType and MatchOnlyTextFieldType are text fields, but their base classes are not TextFieldMapper.TextFieldType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍. Will look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this one: we don't push either of those fields. I've expanded the tests to hit that as well. We can pick them up in follow-up changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! LGTM
The docs are just fantastic.

import org.elasticsearch.xpack.esql.core.expression.Expression;
import org.elasticsearch.xpack.esql.core.querydsl.query.Query;
import org.elasticsearch.xpack.esql.optimizer.rules.physical.local.LucenePushdownPredicates;
import org.elasticsearch.xpack.esql.planner.TranslatorHandler;

/**
* Expressions implementing this interface can get called on data nodes to provide an Elasticsearch/Lucene query.
* Expressions implementing this interface are asked provide an
* Elasticsearch/Lucene query on the as part of the data node optimizations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the is probably a leftover

*/
YES(FinishedTranslatable.YES),
/**
* Translation requires a recheck. Calling {@link TranslationAware#asQuery} will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@nik9000 nik9000 enabled auto-merge (squash) May 8, 2025 12:56
@nik9000 nik9000 merged commit 3551494 into elastic:main May 8, 2025
16 of 17 checks passed
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request May 9, 2025
Reenables `text ==` pushdown and adds support for `text !=` pushdown.

It does so by making `TranslationAware#translatable` return something
we can turn into a tri-valued function. It has these values:
* `YES`
* `NO`
* `RECHECK`

`YES` means the `Expression` is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.

`NO` means the `Expression` can't be pushed to Lucene at all and will stay
in the plan.

`RECHECK` mean the `Expression` can push a query that makes *candidate*
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene *and* left in the plan.

This is required because `txt != "b"` can build a *candidate* query
against the `txt.keyword` subfield but it can't be sure of the match
without loading the `_source` - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic
performance tests:
```
Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}    
```

This comes from:
```
rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}


for a in {1..100}; do
    text_eq
    not_text_eq
done
```
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
Reenables `text ==` pushdown and adds support for `text !=` pushdown.

It does so by making `TranslationAware#translatable` return something
we can turn into a tri-valued function. It has these values:
* `YES`
* `NO`
* `RECHECK`

`YES` means the `Expression` is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.

`NO` means the `Expression` can't be pushed to Lucene at all and will stay
in the plan.

`RECHECK` mean the `Expression` can push a query that makes *candidate*
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene *and* left in the plan.

This is required because `txt != "b"` can build a *candidate* query
against the `txt.keyword` subfield but it can't be sure of the match
without loading the `_source` - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic
performance tests:
```
Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}    
```

This comes from:
```
rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}


for a in {1..100}; do
    text_eq
    not_text_eq
done
```
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 19, 2025
Reenables `text ==` pushdown and adds support for `text !=` pushdown.

It does so by making `TranslationAware#translatable` return something
we can turn into a tri-valued function. It has these values:
* `YES`
* `NO`
* `RECHECK`

`YES` means the `Expression` is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.

`NO` means the `Expression` can't be pushed to Lucene at all and will stay
in the plan.

`RECHECK` mean the `Expression` can push a query that makes *candidate*
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene *and* left in the plan.

This is required because `txt != "b"` can build a *candidate* query
against the `txt.keyword` subfield but it can't be sure of the match
without loading the `_source` - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic
performance tests:
```
Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}
```

This comes from:
```
rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

for a in {1..100}; do
    text_eq
    not_text_eq
done
```
@nik9000
Copy link
Member Author

nik9000 commented May 19, 2025

Backported by #128156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants