Skip to content

Speed up hierarchical k-means by computing distances in bulk #132384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 4, 2025

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Aug 4, 2025

This commit adds on-heap bulk distance computations. In particular, it implements the methods ESVectorUtil#squareDistanceBulk and ``ESVectorUtil#soarDistanceBulk` to compute four distances in one method call. Microbenchmarks shows a nice speed up, for example for AVX2:

Benchmark                                   (dims)   Mode  Cnt  Score   Error   Units
SquareDistanceBenchmark.soarDistance           384  thrpt    5  4.387 ± 0.167  ops/ms
SquareDistanceBenchmark.soarDistance           782  thrpt    5  1.952 ± 0.357  ops/ms
SquareDistanceBenchmark.soarDistance          1024  thrpt    5  1.658 ± 0.817  ops/ms
SquareDistanceBenchmark.soarDistanceBulk       384  thrpt    5  6.627 ± 0.292  ops/ms
SquareDistanceBenchmark.soarDistanceBulk       782  thrpt    5  3.577 ± 0.255  ops/ms
SquareDistanceBenchmark.soarDistanceBulk      1024  thrpt    5  3.171 ± 0.360  ops/ms
SquareDistanceBenchmark.squareDistance         384  thrpt    5  5.853 ± 0.519  ops/ms
SquareDistanceBenchmark.squareDistance         782  thrpt    5  2.844 ± 0.034  ops/ms
SquareDistanceBenchmark.squareDistance        1024  thrpt    5  2.515 ± 0.104  ops/ms
SquareDistanceBenchmark.squareDistanceBulk     384  thrpt    5  8.669 ± 1.235  ops/ms
SquareDistanceBenchmark.squareDistanceBulk     782  thrpt    5  4.012 ± 0.449  ops/ms
SquareDistanceBenchmark.squareDistanceBulk    1024  thrpt    5  3.360 ± 0.454  ops/ms

The commit updates k-means local to use those new methods which shows a nice speed up. For example, indexing 3 million GLOVE vectors with 200 dimensions.

before:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf   3000008           39798                 81817             0

after:


index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf   3000008           29074                 56799             0

Or 1 million Cohere vectors with 1024 dimensions:

before:

index_name       index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------  ----------  --------  --------------  --------------------  ------------  
wiki1024en.docs         ivf   1000008           51073                106968             0

after:

index_name       index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------  ----------  --------  --------------  --------------------  ------------  
wiki1024en.docs         ivf   1000008           32692                 85660             0

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Aug 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iverase iverase merged commit 6578b9e into elastic:main Aug 4, 2025
34 checks passed
@iverase iverase deleted the distanceBulk branch August 4, 2025 11:32
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 5, 2025
…cking

* upstream/main: (26 commits)
  [Fleet] add privileges to `kibana_system` to read integrations data (elastic#132400)
  Add `TestEntitlementsRule` with support for dynamic entitled node paths for testing (elastic#132077)
  Reduce logging frequency for GCS per project clients (elastic#132429)
  Skip update/100_synthetic_source tests in yamlRestCompatTests (elastic#132296)
  Correct exception for missing nested path (elastic#132408)
  Fixing esql release tests elastic#132369 (elastic#132406)
  Adjust date docvalue formatting to return 4xx instead of 5xx (elastic#132414)
  Handle nested fields with the termvectors REST API in artificial docs (elastic#92568)
  Only collect bulk scored vectors when exceeding min competitive (elastic#132293)
  Fix release tests diskbbq update (elastic#132405)
  ESQL: Fix skipping of generative tests (elastic#132390)
  Short circuit failure handling in OIDC flow (elastic#130618)
  Small optimization in OptimizedScalarQuantizer by using mul instead of div (elastic#132397)
  Aggs: Add validation to Bucket script pipeline agg (elastic#132320)
  ESQL: Multiple parameters in ungrouped aggs (elastic#132375)
  ESQL: Explain test operators (elastic#132374)
  EQL: Deal with internally created IN in a different way for EQL (elastic#132167)
  Speed up hierarchical k-means by computing distances in bulk (elastic#132384)
  Reduce the number of fields per document (elastic#132322)
  Assert current thread in ESQL (elastic#132324)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants