Skip to content

Speed up bit compared with floats or bytes script operations #117199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

benwtrent
Copy link
Member

Instead of doing an "if" statement, which doesn't lend itself to vectorization, I switched to expand to the bits and multiply the 1s and 0s.

This led to a marginal speed improvement on ARM.

I expect that Panama vector could be used here to be even faster, but I didn't want to spend anymore time on this for the time being.

Benchmark                                              (dims)   Mode  Cnt  Score   Error   Units
IpBitVectorScorerBenchmark.dotProductByteIfStatement      768  thrpt    5  2.952 ± 0.026  ops/us
IpBitVectorScorerBenchmark.dotProductByteUnwrap           768  thrpt    5  4.017 ± 0.068  ops/us
IpBitVectorScorerBenchmark.dotProductFloatIfStatement     768  thrpt    5  2.987 ± 0.124  ops/us
IpBitVectorScorerBenchmark.dotProductFloatUnwrap          768  thrpt    5  4.726 ± 0.136  ops/us

Benchmark I used.
https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9

@benwtrent benwtrent added >enhancement auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v9.0.0 v8.17.0 labels Nov 20, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 20, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

Copy link
Contributor

@pmpailis pmpailis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// now combine the two vectors, summing the byte dimensions where the bit in d is `1`
for (int i = 0; i < d.length; i++) {
byte mask = d[i];
acc0 += fma(q[i * Byte.SIZE + 0], (mask >> 7) & 1, acc0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlooked this one initially; but shouldn't the additive component to fma be either 0 or just reset the value of acc0 (i.e. without +=) ? I think that we're making the addition twice in for the accumulators for last bits in lines 80-83.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmpailis you are correct ;) I did a bad copy paste here. Tests have found it.

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 2, 2024
@elasticsearchmachine elasticsearchmachine merged commit e10fc3c into elastic:main Dec 2, 2024
16 checks passed
@benwtrent benwtrent deleted the feature/bit-float-byte-ip-speedup branch December 2, 2024 17:19
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Dec 2, 2024
…#117199)

Instead of doing an "if" statement, which doesn't lend itself to
vectorization, I switched to expand to the bits and multiply the 1s and
0s.

This led to a marginal speed improvement on ARM.

I expect that Panama vector could be used here to be even faster, but I
didn't want to spend anymore time on this for the time being.

```
Benchmark                                              (dims)   Mode  Cnt  Score   Error   Units
IpBitVectorScorerBenchmark.dotProductByteIfStatement      768  thrpt    5  2.952 ± 0.026  ops/us
IpBitVectorScorerBenchmark.dotProductByteUnwrap           768  thrpt    5  4.017 ± 0.068  ops/us
IpBitVectorScorerBenchmark.dotProductFloatIfStatement     768  thrpt    5  2.987 ± 0.124  ops/us
IpBitVectorScorerBenchmark.dotProductFloatUnwrap          768  thrpt    5  4.726 ± 0.136  ops/us
```

Benchmark I used.
https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9
elasticsearchmachine pushed a commit that referenced this pull request Dec 2, 2024
#117841)

Instead of doing an "if" statement, which doesn't lend itself to
vectorization, I switched to expand to the bits and multiply the 1s and
0s.

This led to a marginal speed improvement on ARM.

I expect that Panama vector could be used here to be even faster, but I
didn't want to spend anymore time on this for the time being.

```
Benchmark                                              (dims)   Mode  Cnt  Score   Error   Units
IpBitVectorScorerBenchmark.dotProductByteIfStatement      768  thrpt    5  2.952 ± 0.026  ops/us
IpBitVectorScorerBenchmark.dotProductByteUnwrap           768  thrpt    5  4.017 ± 0.068  ops/us
IpBitVectorScorerBenchmark.dotProductFloatIfStatement     768  thrpt    5  2.987 ± 0.124  ops/us
IpBitVectorScorerBenchmark.dotProductFloatUnwrap          768  thrpt    5  4.726 ± 0.136  ops/us
```

Benchmark I used.
https://gist.github.com/benwtrent/b0edb3975d2f03356c1a5ea84c72abc9
int acc2 = 0;
int acc3 = 0;
// now combine the two vectors, summing the byte dimensions where the bit in d is `1`
for (int i = 0; i < d.length; i++) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a drive-by question here (free to disregard): is this intended to allow vectorization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svilen-mihaylov-db it allows some vectorization via the unrolling, but it definitely isn't as fast as a custom vectorized version that we could provide with the Panama API. This solution isn't as fast as it could be, for sure.

Mainly, I discovered its much faster than the previous if block and so its a step in the right direction :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants