-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Reduce heap usage in hierarchical k-means #132391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR aims to reduce heap memory usage in hierarchical k-means clustering by eliminating unnecessary array allocations and avoiding duplicate centroids storage.
- Replaces the temporary
nextCentroids
array with in-place centroid updates using an additional loop - Changes 2D array allocation to use
null
initialization instead of pre-allocating inner arrays that get overwritten - Adds early return for empty sub-partitions to avoid unnecessary processing
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
KMeansLocal.java | Removes nextCentroids array and implements in-place centroid updates with an extra vector loop |
HierarchicalKMeans.java | Changes array allocation to avoid pre-allocating inner arrays and adds empty partition check |
Comments suppressed due to low confidence (1)
server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/HierarchicalKMeans.java:181
- There is an extra closing brace at the end of the file. This appears to be a formatting error that should be removed.
}
server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/KMeansLocal.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
float[][] nextCentroids, | ||
int[] centroidCounts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of having boolean[] isChanged
or FixedBitSet isChanged
and we only adjust the centroids that are actually changed (basically, keeping track of changed centroids, so new FixedBitSet(centroids.length)
).
It seems to me that as the steps increase, fewer centroids will actually get mutated. This will still significantly reduce heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I follow you but we are only mutating changed centroids by checking if the counts are bigger than 0, e.g:
for (int clusterIdx = 0; clusterIdx < centroids.length; clusterIdx++) {
if (centroidCounts[clusterIdx] > 0) {
Arrays.fill(centroids[clusterIdx], 0.0f);
}
}
And we need the counts for computing the centroids:
for (int clusterIdx = 0; clusterIdx < centroids.length; clusterIdx++) {
if (centroidCounts[clusterIdx] > 0) {
float countF = (float) centroidCounts[clusterIdx];
for (int d = 0; d < dim; d++) {
centroids[clusterIdx][d] /= countF;
}
}
In general I don't think this allocation is problematic as later on we will allocate an array for soar assignments which should be much bigger than this array. That's not the case for the nextCentroids array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase centroidCounts
accounts for centroids that have assigned vectors but never changed?
I am saying it seems like if ANY centroid changes at all, we rebuild all of them. This seems wrong to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh! sorry, I got you know, I will have a go tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, it makes sense to me, lmk what you think.
I notice we are allocating many float[] during hierarchical kmeans and we require to have all the centroids twice in memory which seems very wasteful as it is the biggest contributor to heap.
It can be improved in two places:
During HierarchicalKMeans#updateAssignmentsWithRecursiveSplit we are allocating a new two dimensional array for merging the results. The second dimension array is allocated to the number of dimensions but later we overwrite those arrays while copying. We better initialize that array to null so we don't over-allocate.
During KMeansLocal#stepLloyd we are using a second array to update the centroids. I believe this is done for performance but I think is the wrong trade off as this array can be pretty big and we would better do it in place, although it means we need an extra loop over all the vectors.
This commit implements the two points above.