-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
Elasticsearch Version
main (198148c)
Installed Plugins
n/a
Java Version
bundled
OS Version
6.8.0-64-generic #67~22.04.1-Ubuntu
Problem Description
When a wildcard fields contains unicode code points with high and low surrogates, for example an emoji, some queries witih leading wild cards fail to return results.
I found this when trying to make a term query to a wild card field with a query containing an emoji. It caused this assertion to fail because the tokenSize is greater than 3 when there is an emoji. Despite the assertion failure, term queries appear to work correctly. Any fix to wildcard queries should remove or fix this assertion as well.
Steps to Reproduce
Setup:
curl -X PUT "localhost:9200/emoji_logs" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"message": {
"type": "wildcard"
}
}
}
}
'
curl -X POST "localhost:9200/emoji_logs/_doc" -H 'Content-Type: application/json' -d'
{
"message": "😀a"
}
'
The following query fails to return results:
curl -X GET "localhost:9200/emoji_logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"wildcard": {
"message": {
"value": "*😀a*"
}
}
}
}
'
A similar query without the leading wildcard:
curl -X GET "localhost:9200/emoji_logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"wildcard": {
"message": {
"value": "😀a*"
}
}
}
}
'
Does produces hits:
{
"_index" : "emoji_logs",
"_id" : "PW5XWZgBIshG2EdpkMI3",
"_score" : 1.0,
"_source" : {
"message" : "\uD83D\uDE00a"
}
}
Likewise, a similar query with a leading wildcard, but no emoji, does produce results:
curl -X GET "localhost:9200/emoji_logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"wildcard": {
"message": {
"value": "*a*"
}
}
}
}
'
Logs (if relevant)
No response