Skip to content

Reduce heap usage in hierarchical k-means #132391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Aug 4, 2025

I notice we are allocating many float[] during hierarchical kmeans and we require to have all the centroids twice in memory which seems very wasteful as it is the biggest contributor to heap.

It can be improved in two places:

  1. During HierarchicalKMeans#updateAssignmentsWithRecursiveSplit we are allocating a new two dimensional array for merging the results. The second dimension array is allocated to the number of dimensions but later we overwrite those arrays while copying. We better initialize that array to null so we don't over-allocate.

  2. During KMeansLocal#stepLloyd we are using a second array to update the centroids. I believe this is done for performance but I think is the wrong trade off as this array can be pretty big and we would better do it in place, although it means we need an extra loop over all the vectors.

This commit implements the two points above.

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Aug 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to reduce heap memory usage in hierarchical k-means clustering by eliminating unnecessary array allocations and avoiding duplicate centroids storage.

  • Replaces the temporary nextCentroids array with in-place centroid updates using an additional loop
  • Changes 2D array allocation to use null initialization instead of pre-allocating inner arrays that get overwritten
  • Adds early return for empty sub-partitions to avoid unnecessary processing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
KMeansLocal.java Removes nextCentroids array and implements in-place centroid updates with an extra vector loop
HierarchicalKMeans.java Changes array allocation to avoid pre-allocating inner arrays and adds empty partition check
Comments suppressed due to low confidence (1)

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/HierarchicalKMeans.java:181

  • There is an extra closing brace at the end of the file. This appears to be a formatting error that should be removed.
}

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

float[][] nextCentroids,
int[] centroidCounts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of having boolean[] isChanged or FixedBitSet isChanged and we only adjust the centroids that are actually changed (basically, keeping track of changed centroids, so new FixedBitSet(centroids.length)).

It seems to me that as the steps increase, fewer centroids will actually get mutated. This will still significantly reduce heap.

Copy link
Contributor Author

@iverase iverase Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I follow you but we are only mutating changed centroids by checking if the counts are bigger than 0, e.g:

          for (int clusterIdx = 0; clusterIdx < centroids.length; clusterIdx++) {
                if (centroidCounts[clusterIdx] > 0) {
                    Arrays.fill(centroids[clusterIdx], 0.0f);
                }
            }

And we need the counts for computing the centroids:

             for (int clusterIdx = 0; clusterIdx < centroids.length; clusterIdx++) {
                if (centroidCounts[clusterIdx] > 0) {
                    float countF = (float) centroidCounts[clusterIdx];
                    for (int d = 0; d < dim; d++) {
                        centroids[clusterIdx][d] /= countF;
                    }
                }

In general I don't think this allocation is problematic as later on we will allocate an array for soar assignments which should be much bigger than this array. That's not the case for the nextCentroids array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iverase centroidCounts accounts for centroids that have assigned vectors but never changed?

I am saying it seems like if ANY centroid changes at all, we rebuild all of them. This seems wrong to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh! sorry, I got you know, I will have a go tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it makes sense to me, lmk what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants