Skip to content

Hierarchical centroid storage for DiskBBQ #132010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 30, 2025

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jul 28, 2025

This commit presents a hierarchical layer on top of the DiskBBQ centroids. The idea is to run the k-means algorithm on the generated centroids to get a centroid parent layer that can be used at search time to prevent having to score every single centroid in a neighbour queue. This implementation uses a fix of 16 centroids per parent cluster.

At search time, we will score all parents. In order to keep the same recall as of now, this PR uses a fix length queue of children centroids. The size is currently define by a percentage of the number of centroids, which is set to 10%. Therefore, we will process the parent centroid queue until we fill the children queue.

Once the children queue is full. we start visiting the posting lists in order. Note that whenever e remove a centroid from the children posting list, we add a new centroid from the parent list, so we always have a fix number of centroids scored.

We observe with this approach that we are doing 75% - 80% less centroid operations while keeping the recall.

@iverase iverase marked this pull request as draft July 28, 2025 12:45
@iverase iverase marked this pull request as ready for review July 30, 2025 14:19
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iverase iverase merged commit 63b77a6 into elastic:main Jul 30, 2025
33 checks passed
@iverase iverase deleted the hierarchical_centroids branch July 30, 2025 15:03
@tteofili
Copy link
Contributor

woah! 💯

afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Jul 31, 2025
This commit presents a hierarchical layer on top of the DiskBBQ centroids to reduce the number of centroids scored
at search time.
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Jul 31, 2025
This commit presents a hierarchical layer on top of the DiskBBQ centroids to reduce the number of centroids scored
at search time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants