Skip to content

Wildcard query with emoji and leading wildcard missing results #132144

@parkertimmins

Description

@parkertimmins

Elasticsearch Version

main (198148c)

Installed Plugins

n/a

Java Version

bundled

OS Version

6.8.0-64-generic #67~22.04.1-Ubuntu

Problem Description

When a wildcard fields contains unicode code points with high and low surrogates, for example an emoji, some queries witih leading wild cards fail to return results.

I found this when trying to make a term query to a wild card field with a query containing an emoji. It caused this assertion to fail because the tokenSize is greater than 3 when there is an emoji. Despite the assertion failure, term queries appear to work correctly. Any fix to wildcard queries should remove or fix this assertion as well.

Steps to Reproduce

Setup:

curl -X PUT "localhost:9200/emoji_logs" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "message": {
        "type": "wildcard"
      }
    }
  }
}
'

curl -X POST "localhost:9200/emoji_logs/_doc" -H 'Content-Type: application/json' -d'
{
  "message": "😀a" 
}
'

The following query fails to return results:

curl -X GET "localhost:9200/emoji_logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "message": {
        "value": "*😀a*"
      }
    }
  }
}
'

A similar query without the leading wildcard:


curl -X GET "localhost:9200/emoji_logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "message": {
        "value": "😀a*"
      }
    }
  }
}
'

Does produces hits:

   {
        "_index" : "emoji_logs",
        "_id" : "PW5XWZgBIshG2EdpkMI3",
        "_score" : 1.0,
        "_source" : {
          "message" : "\uD83D\uDE00a"
        }
      }

Likewise, a similar query with a leading wildcard, but no emoji, does produce results:


curl -X GET "localhost:9200/emoji_logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "message": {
        "value": "*a*"
      }
    }
  }
}
'

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions