-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Speed up bit compared with floats or bytes script operations #117199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up bit compared with floats or bytes script operations #117199
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Hi @benwtrent, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// now combine the two vectors, summing the byte dimensions where the bit in d is `1` | ||
for (int i = 0; i < d.length; i++) { | ||
byte mask = d[i]; | ||
acc0 += fma(q[i * Byte.SIZE + 0], (mask >> 7) & 1, acc0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overlooked this one initially; but shouldn't the additive component to fma
be either 0 or just reset the value of acc0
(i.e. without +=
) ? I think that we're making the addition twice in for the accumulators for last bits in lines 80-83.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmpailis you are correct ;) I did a bad copy paste here. Tests have found it.
…#117199) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9
#117841) Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s. This led to a marginal speed improvement on ARM. I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being. ``` Benchmark (dims) Mode Cnt Score Error Units IpBitVectorScorerBenchmark.dotProductByteIfStatement 768 thrpt 5 2.952 ± 0.026 ops/us IpBitVectorScorerBenchmark.dotProductByteUnwrap 768 thrpt 5 4.017 ± 0.068 ops/us IpBitVectorScorerBenchmark.dotProductFloatIfStatement 768 thrpt 5 2.987 ± 0.124 ops/us IpBitVectorScorerBenchmark.dotProductFloatUnwrap 768 thrpt 5 4.726 ± 0.136 ops/us ``` Benchmark I used. https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9
int acc2 = 0; | ||
int acc3 = 0; | ||
// now combine the two vectors, summing the byte dimensions where the bit in d is `1` | ||
for (int i = 0; i < d.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a drive-by question here (free to disregard): is this intended to allow vectorization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svilen-mihaylov-db it allows some vectorization via the unrolling, but it definitely isn't as fast as a custom vectorized version that we could provide with the Panama API. This solution isn't as fast as it could be, for sure.
Mainly, I discovered its much faster than the previous if
block and so its a step in the right direction :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining!
Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s.
This led to a marginal speed improvement on ARM.
I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being.
Benchmark I used.
https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9