Speed up bit compared with floats or bytes script operations #117199

benwtrent · 2024-11-20T21:07:08Z

Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s.

This led to a marginal speed improvement on ARM.

I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being.

Benchmark                                              (dims)   Mode  Cnt  Score   Error   Units
IpBitVectorScorerBenchmark.dotProductByteIfStatement      768  thrpt    5  2.952 ± 0.026  ops/us
IpBitVectorScorerBenchmark.dotProductByteUnwrap           768  thrpt    5  4.017 ± 0.068  ops/us
IpBitVectorScorerBenchmark.dotProductFloatIfStatement     768  thrpt    5  2.987 ± 0.124  ops/us
IpBitVectorScorerBenchmark.dotProductFloatUnwrap          768  thrpt    5  4.726 ± 0.136  ops/us

Benchmark I used.
https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9

elasticsearchmachine · 2024-11-20T21:07:33Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2024-11-20T21:07:33Z

Hi @benwtrent, I've created a changelog YAML for you.

…yte-ip-speedup

pmpailis

LGTM

john-wagster

LGTM

pmpailis · 2024-12-02T15:48:21Z

...c/main/java/org/elasticsearch/simdvec/internal/vectorization/DefaultESVectorUtilSupport.java

+        // now combine the two vectors, summing the byte dimensions where the bit in d is `1`
+        for (int i = 0; i < d.length; i++) {
+            byte mask = d[i];
+            acc0 += fma(q[i * Byte.SIZE + 0], (mask >> 7) & 1, acc0);


Overlooked this one initially; but shouldn't the additive component to fma be either 0 or just reset the value of acc0 (i.e. without +=) ? I think that we're making the addition twice in for the accumulators for last bits in lines 80-83.

@pmpailis you are correct ;) I did a bad copy paste here. Tests have found it.

…#117199) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9

#117841) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9

svilen-mihaylov-db · 2024-12-02T19:07:08Z

...c/main/java/org/elasticsearch/simdvec/internal/vectorization/DefaultESVectorUtilSupport.java

+        int acc2 = 0;
+        int acc3 = 0;
+        // now combine the two vectors, summing the byte dimensions where the bit in d is `1`
+        for (int i = 0; i < d.length; i++) {


Just a drive-by question here (free to disregard): is this intended to allow vectorization?

@svilen-mihaylov-db it allows some vectorization via the unrolling, but it definitely isn't as fast as a custom vectorized version that we could provide with the Panama API. This solution isn't as fast as it could be, for sure.

Mainly, I discovered its much faster than the previous if block and so its a step in the right direction :)

Thanks for explaining!

Speed up bit float/byte operations slightly

782c03b

benwtrent added >enhancement auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v9.0.0 v8.17.0 labels Nov 20, 2024

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 20, 2024

Update docs/changelog/117199.yaml

d98754d

elasticsearchmachine added v8.18.0 and removed v8.17.0 labels Nov 20, 2024

benwtrent and others added 3 commits November 21, 2024 13:23

Merge branch 'main' into feature/bit-float-byte-ip-speedup

4ed7c94

Merge remote-tracking branch 'upstream/main' into feature/bit-float-b…

73d01b7

…yte-ip-speedup

fix impl

c39d363

pmpailis approved these changes Dec 2, 2024

View reviewed changes

john-wagster approved these changes Dec 2, 2024

View reviewed changes

pmpailis reviewed Dec 2, 2024

View reviewed changes

fixing fma

0cea632

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 2, 2024

elasticsearchmachine merged commit e10fc3c into elastic:main Dec 2, 2024
16 checks passed

benwtrent deleted the feature/bit-float-byte-ip-speedup branch December 2, 2024 17:19

benwtrent mentioned this pull request Dec 2, 2024

[8.x] Speed up bit compared with floats or bytes script operations (#117199) #117841

Merged

svilen-mihaylov-db reviewed Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up bit compared with floats or bytes script operations #117199

Speed up bit compared with floats or bytes script operations #117199

Uh oh!

benwtrent commented Nov 20, 2024

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

pmpailis left a comment

Uh oh!

john-wagster left a comment

Uh oh!

pmpailis Dec 2, 2024

Uh oh!

benwtrent Dec 2, 2024

Uh oh!

Uh oh!

svilen-mihaylov-db Dec 2, 2024

Uh oh!

benwtrent Dec 2, 2024

Uh oh!

svilen-mihaylov-db Dec 2, 2024

Uh oh!

Uh oh!

Speed up bit compared with floats or bytes script operations #117199

Speed up bit compared with floats or bytes script operations #117199

Uh oh!

Conversation

benwtrent commented Nov 20, 2024

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

pmpailis left a comment

Choose a reason for hiding this comment

Uh oh!

john-wagster left a comment

Choose a reason for hiding this comment

Uh oh!

pmpailis Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

benwtrent Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

svilen-mihaylov-db Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

benwtrent Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

svilen-mihaylov-db Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!