Reduce heap usage in hierarchical k-means #132391

iverase · 2025-08-04T12:00:17Z

I notice we are allocating many float[] during hierarchical kmeans and we require to have all the centroids twice in memory which seems very wasteful as it is the biggest contributor to heap.

It can be improved in two places:

During HierarchicalKMeans#updateAssignmentsWithRecursiveSplit we are allocating a new two dimensional array for merging the results. The second dimension array is allocated to the number of dimensions but later we overwrite those arrays while copying. We better initialize that array to null so we don't over-allocate.
During KMeansLocal#stepLloyd we are using a second array to update the centroids. I believe this is done for performance but I think is the wrong trade off as this array can be pretty big and we would better do it in place, although it means we need an extra loop over all the vectors.

This commit implements the two points above.

elasticsearchmachine · 2025-08-04T12:00:42Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copilot

Pull Request Overview

This PR aims to reduce heap memory usage in hierarchical k-means clustering by eliminating unnecessary array allocations and avoiding duplicate centroids storage.

Replaces the temporary nextCentroids array with in-place centroid updates using an additional loop
Changes 2D array allocation to use null initialization instead of pre-allocating inner arrays that get overwritten
Adds early return for empty sub-partitions to avoid unnecessary processing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
KMeansLocal.java	Removes `nextCentroids` array and implements in-place centroid updates with an extra vector loop
HierarchicalKMeans.java	Changes array allocation to avoid pre-allocating inner arrays and adds empty partition check

Comments suppressed due to low confidence (1)

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/HierarchicalKMeans.java:181

There is an extra closing brace at the end of the file. This appears to be a formatting error that should be removed.

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/KMeansLocal.java

john-wagster

lgtm

benwtrent · 2025-08-04T16:29:41Z

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/KMeansLocal.java

-        float[][] nextCentroids,
+        int[] centroidCounts,


What do you think of having boolean[] isChanged or FixedBitSet isChanged and we only adjust the centroids that are actually changed (basically, keeping track of changed centroids, so new FixedBitSet(centroids.length)).

It seems to me that as the steps increase, fewer centroids will actually get mutated. This will still significantly reduce heap.

I am not sure if I follow you but we are only mutating changed centroids by checking if the counts are bigger than 0, e.g:

for (int clusterIdx = 0; clusterIdx < centroids.length; clusterIdx++) { if (centroidCounts[clusterIdx] > 0) { Arrays.fill(centroids[clusterIdx], 0.0f); } }

And we need the counts for computing the centroids:

for (int clusterIdx = 0; clusterIdx < centroids.length; clusterIdx++) { if (centroidCounts[clusterIdx] > 0) { float countF = (float) centroidCounts[clusterIdx]; for (int d = 0; d < dim; d++) { centroids[clusterIdx][d] /= countF; } }

In general I don't think this allocation is problematic as later on we will allocate an array for soar assignments which should be much bigger than this array. That's not the case for the nextCentroids array.

@iverase centroidCounts accounts for centroids that have assigned vectors but never changed?

I am saying it seems like if ANY centroid changes at all, we rebuild all of them. This seems wrong to me.

doh! sorry, I got you know, I will have a go tomorrow.

Done, it makes sense to me, lmk what you think.

Reduce heap usage in hierarchical k-means

bfd6692

iverase requested review from benwtrent, john-wagster and Copilot August 4, 2025 12:00

iverase added >non-issue :Search Relevance/Vectors Vector search v9.2.0 labels Aug 4, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Aug 4, 2025

Copilot AI reviewed Aug 4, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/KMeansLocal.java Outdated Show resolved Hide resolved

john-wagster approved these changes Aug 4, 2025

View reviewed changes

benwtrent reviewed Aug 4, 2025

View reviewed changes

iverase added 3 commits August 5, 2025 08:11

Merge branch 'main' into heap-hkmeans

9b215d0

Ony recompute changed centroids

ee61553

iter

9a2c5b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce heap usage in hierarchical k-means #132391

Reduce heap usage in hierarchical k-means #132391

iverase commented Aug 4, 2025

Uh oh!

elasticsearchmachine commented Aug 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

john-wagster left a comment

Uh oh!

benwtrent Aug 4, 2025

Uh oh!

iverase Aug 4, 2025 •

edited

Loading

Uh oh!

benwtrent Aug 4, 2025

Uh oh!

iverase Aug 4, 2025

Uh oh!

iverase Aug 5, 2025

Uh oh!

Uh oh!

Reduce heap usage in hierarchical k-means #132391

Are you sure you want to change the base?

Reduce heap usage in hierarchical k-means #132391

Conversation

iverase commented Aug 4, 2025

Uh oh!

elasticsearchmachine commented Aug 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

john-wagster left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

iverase Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

iverase Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

iverase Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iverase Aug 4, 2025 •

edited

Loading