Skip to content

[ML] Add DeBERTa-V2/V3 tokenizer #111852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Oct 3, 2024
Merged

[ML] Add DeBERTa-V2/V3 tokenizer #111852

merged 25 commits into from
Oct 3, 2024

Conversation

maxhniebergall
Copy link
Contributor

@maxhniebergall maxhniebergall commented Aug 13, 2024

This PR adds support for the DeBERTa-V2/V3 tokenizer (DeBERTa-V2 and V3 use the same tokenizer). It also adds the balanced truncation method.

This PR contains:

  • the new classes and the new named writables
  • the tokens in Deberta are slightly different but other than that the code is almost identical to xlmRoberta
  • I added the balanced truncation option which will now be available with all models which take two inputs
  • byte fall back (decomposeBytePieces) for unknown tokens

Commit 54f9995#diff-2f9a331e12c3fb7602085cb04501a93f424bafa029864c2f798b83860e9679b7 includes end-to-end testing of tokenization with a real vocab file from Huggingface, but it was too large to include in CI.

@elasticsearchmachine
Copy link
Collaborator

Hi @maxhniebergall, I've created a changelog YAML for you.

tokenIds.add(IntStream.of(sepTokenId));
tokenMap.add(IntStream.of(SPECIAL_TOKEN_POSITION));
}
seqPairOffset = withSpecialTokens ? tokenId1s.size() + 2 : tokenId1s.size();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the final special token isn't included in the offset

@maxhniebergall maxhniebergall marked this pull request as ready for review September 5, 2024 18:27
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 5, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@maxhniebergall
Copy link
Contributor Author

@elasticmachine merge upstream

@maxhniebergall
Copy link
Contributor Author

maxhniebergall commented Sep 13, 2024

theres an issue with the tokenizer to do with this special character \xad (which is a hidden "soft-hyphen").

In our system, that character is tokenized as [1,336,2,507,3,2], but python is giving [1, 336, 2, 507, 198, 177, 2]. token 507 is the whitespace character, so I guess sentencepiece is interpreting the soft-hyphen as a whitespace character, added onto unknown character in ES and as \xad in python. In the vocabulary, 198 is <0xC2> and 177 is <0xAD>, which corresponds to the UTF-8 hex code for soft-hyphen. So in python, the character is being correctly broken down into the these two byte, but in our system it is being interpreted as one unknown character...

I believe I figured out why this is happening:

by default BYTE_FALLBACK is false by default, but in the deBERTa-V3 SPM file it is true.

I guess there might be all kinds of other settings that could be different as well...

@maxhniebergall maxhniebergall requested review from a team as code owners September 17, 2024 20:26
@maxhniebergall maxhniebergall changed the base branch from main to detached September 17, 2024 20:26
@maxhniebergall maxhniebergall requested a review from a team as a code owner September 17, 2024 20:26
@maxhniebergall maxhniebergall changed the base branch from detached to main September 17, 2024 20:26
@maxhniebergall maxhniebergall changed the base branch from main to octaavio-patch-1 September 17, 2024 20:34
@maxhniebergall maxhniebergall changed the base branch from octaavio-patch-1 to main September 17, 2024 20:34
byte_fallback decomposes unknown tokens into multiple tokens each of one byte if those bytes are in the vocabulary.
@maxhniebergall maxhniebergall requested review from davidkyle and removed request for a team September 17, 2024 20:40
@maxhniebergall
Copy link
Contributor Author

I've tested the tokenizer on the first 6000 query-passage pairs in MS-MARCO and the differences in score between the direct-pytorch and elasticsearch versions are around 10^-6 at most (a few low 10^-5), and the total variance was 6.320295683227377e-12, which is extremely low. I spot checked some of the cases and there wasn't any difference in the tokenization, so I believe this is due to errors in floating point arithmetic, as the orders of magnitude align with that. I'm going to run the test on the rest of MS-MARCO overnight to see if there are any other outliers.

@maxhniebergall
Copy link
Contributor Author

@elasticmachine merge upstream

@maxhniebergall
Copy link
Contributor Author

Manual testing revealed that any differences in tokenization are due to floating point rounding in scores and are not fixable.

}

@Override
int defaultSpanForChunking(int maxWindowSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: All the implementations of this method are the same it should be implemented in the base class NlpTokenizer.

It's not related to your change but would be nice to clean up

if (node.id == unknownTokenId && fuseUnk) {
if (node.id == unknownTokenId && byteFallback) {
CharSequence multiByteSequence = inputSequence.subSequence(node.startsAtCharPos, endsAtChars);
byte[] bytes = multiByteSequence.toString().getBytes(StandardCharsets.UTF_8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is also in decomposeBytePieces. Can decomposeBytePieces take a byte [] argument only do this once

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Commit 54f9995#diff-2f9a331e12c3fb7602085cb04501a93f424bafa029864c2f798b83860e9679b7 includes end-to-end testing of tokenization with a real vocab file from Huggingface, but it was too large to include in CI.

Thanks, this was really useful for verifying the tokenisation.

@maxhniebergall maxhniebergall added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) v8.16.0 and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Oct 3, 2024
@maxhniebergall maxhniebergall merged commit ff53cb7 into main Oct 3, 2024
17 checks passed
@maxhniebergall maxhniebergall deleted the addDebertaTokenizer branch October 3, 2024 15:00
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 111852

matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this pull request Oct 4, 2024
* start to add deberta-v2 tokenizer classes

* continue to add basic tokenizer stuff

* Finish adding DeBERTa-2 tokenizer

Still need to review & test

* Complete test setup and linting

* Update docs/changelog/111852.yaml

* Add serialization of deberta tokenization

* fix request buillder to match model

* debugging

* add balanced truncation

* remove full vocabulary and use tiny vocab for tests

* Remove TODO

* precommit

* Add named writables and known tokenizers

* Add deberta to list of known tokenizers in test

* Add tests for balanced tokenizer and fix errors in tokenizer logic

* fix order of parameters passed to deberta

* Add support for byte_fallback which is enabled for DeBERTa

byte_fallback decomposes unknown tokens into multiple tokens each of one byte if those bytes are in the vocabulary.

* precommit

* update tests to account for byte decomposition

* remove sysout

* fix tests for byteFallback, for real this time

* Move defaultSpanForChunking into super class to avoid repitition

* simplify decomposeBytePieces

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants