[ML] Add DeBERTa-V2/V3 tokenizer #111852

maxhniebergall · 2024-08-13T19:28:02Z

This PR adds support for the DeBERTa-V2/V3 tokenizer (DeBERTa-V2 and V3 use the same tokenizer). It also adds the balanced truncation method.

This PR contains:

the new classes and the new named writables
the tokens in Deberta are slightly different but other than that the code is almost identical to xlmRoberta
I added the balanced truncation option which will now be available with all models which take two inputs
byte fall back (decomposeBytePieces) for unknown tokens

Commit 54f9995#diff-2f9a331e12c3fb7602085cb04501a93f424bafa029864c2f798b83860e9679b7 includes end-to-end testing of tokenization with a real vocab file from Huggingface, but it was too large to include in CI.

elasticsearchmachine · 2024-08-13T19:28:26Z

Hi @maxhniebergall, I've created a changelog YAML for you.

...ml/src/test/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/DebertaV3TestVocab.java

maxhniebergall · 2024-09-05T18:22:24Z

...main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/DebertaTokenizationResult.java

+                tokenIds.add(IntStream.of(sepTokenId));
+                tokenMap.add(IntStream.of(SPECIAL_TOKEN_POSITION));
+            }
+            seqPairOffset = withSpecialTokens ? tokenId1s.size() + 2 : tokenId1s.size();


the final special token isn't included in the offset

elasticsearchmachine · 2024-09-05T18:27:27Z

Pinging @elastic/ml-core (Team:ML)

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java

...ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/DebertaV2Tokenizer.java

maxhniebergall · 2024-09-11T18:59:57Z

@elasticmachine merge upstream

maxhniebergall · 2024-09-13T20:32:24Z

theres an issue with the tokenizer to do with this special character \xad (which is a hidden "soft-hyphen").

In our system, that character is tokenized as [1,336,2,507,3,2], but python is giving [1, 336, 2, 507, 198, 177, 2]. token 507 is the whitespace character, so I guess sentencepiece is interpreting the soft-hyphen as a whitespace character, added onto unknown character in ES and as \xad in python. In the vocabulary, 198 is <0xC2> and 177 is <0xAD>, which corresponds to the UTF-8 hex code for soft-hyphen. So in python, the character is being correctly broken down into the these two byte, but in our system it is being interpreted as one unknown character...

I believe I figured out why this is happening:

by default BYTE_FALLBACK is false by default, but in the deBERTa-V3 SPM file it is true.

I guess there might be all kinds of other settings that could be different as well...

Still need to review & test

byte_fallback decomposes unknown tokens into multiple tokens each of one byte if those bytes are in the vocabulary.

maxhniebergall · 2024-09-17T20:47:25Z

I've tested the tokenizer on the first 6000 query-passage pairs in MS-MARCO and the differences in score between the direct-pytorch and elasticsearch versions are around 10^-6 at most (a few low 10^-5), and the total variance was 6.320295683227377e-12, which is extremely low. I spot checked some of the cases and there wasn't any difference in the tokenization, so I believe this is due to errors in floating point arithmetic, as the orders of magnitude align with that. I'm going to run the test on the rest of MS-MARCO overnight to see if there are any other outliers.

maxhniebergall · 2024-09-18T12:45:07Z

@elasticmachine merge upstream

maxhniebergall · 2024-10-02T12:36:36Z

Manual testing revealed that any differences in tokenization are due to floating point rounding in scores and are not fixable.

davidkyle · 2024-10-02T13:42:06Z

...ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/DebertaV2Tokenizer.java

+    }
+
+    @Override
+    int defaultSpanForChunking(int maxWindowSize) {


nit: All the implementations of this method are the same it should be implemented in the base class NlpTokenizer.

It's not related to your change but would be nice to clean up

davidkyle · 2024-10-02T13:58:49Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java

-            if (node.id == unknownTokenId && fuseUnk) {
+            if (node.id == unknownTokenId && byteFallback) {
+                CharSequence multiByteSequence = inputSequence.subSequence(node.startsAtCharPos, endsAtChars);
+                byte[] bytes = multiByteSequence.toString().getBytes(StandardCharsets.UTF_8);


This line is also in decomposeBytePieces. Can decomposeBytePieces take a byte [] argument only do this once

davidkyle

LGTM

Commit 54f9995#diff-2f9a331e12c3fb7602085cb04501a93f424bafa029864c2f798b83860e9679b7 includes end-to-end testing of tokenization with a real vocab file from Huggingface, but it was too large to include in CI.

Thanks, this was really useful for verifying the tokenisation.

…taTokenizer

elasticsearchmachine · 2024-10-03T15:01:21Z

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 111852

* start to add deberta-v2 tokenizer classes * continue to add basic tokenizer stuff * Finish adding DeBERTa-2 tokenizer Still need to review & test * Complete test setup and linting * Update docs/changelog/111852.yaml * Add serialization of deberta tokenization * fix request buillder to match model * debugging * add balanced truncation * remove full vocabulary and use tiny vocab for tests * Remove TODO * precommit * Add named writables and known tokenizers * Add deberta to list of known tokenizers in test * Add tests for balanced tokenizer and fix errors in tokenizer logic * fix order of parameters passed to deberta * Add support for byte_fallback which is enabled for DeBERTa byte_fallback decomposes unknown tokens into multiple tokens each of one byte if those bytes are in the vocabulary. * precommit * update tests to account for byte decomposition * remove sysout * fix tests for byteFallback, for real this time * Move defaultSpanForChunking into super class to avoid repitition * simplify decomposeBytePieces --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

maxhniebergall added >enhancement :ml Machine learning v8.16.0 labels Aug 13, 2024

maxhniebergall requested a review from davidkyle August 13, 2024 19:28

maxhniebergall commented Aug 13, 2024

View reviewed changes

...ml/src/test/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/DebertaV3TestVocab.java Outdated Show resolved Hide resolved

maxhniebergall commented Sep 5, 2024

View reviewed changes

maxhniebergall marked this pull request as ready for review September 5, 2024 18:27

elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 5, 2024

davidkyle reviewed Sep 10, 2024

View reviewed changes

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

maxhniebergall requested review from a team as code owners September 17, 2024 20:26

maxhniebergall changed the base branch from main to detached September 17, 2024 20:26

maxhniebergall requested a review from a team as a code owner September 17, 2024 20:26

maxhniebergall changed the base branch from detached to main September 17, 2024 20:26

maxhniebergall force-pushed the addDebertaTokenizer branch from 3ea6abd to d2e347e Compare September 17, 2024 20:34

maxhniebergall changed the base branch from main to octaavio-patch-1 September 17, 2024 20:34

maxhniebergall changed the base branch from octaavio-patch-1 to main September 17, 2024 20:34

maxhniebergall and others added 7 commits September 17, 2024 16:39

start to add deberta-v2 tokenizer classes

5097716

continue to add basic tokenizer stuff

7187022

Finish adding DeBERTa-2 tokenizer

0e29241

Still need to review & test

Complete test setup and linting

1e7effc

Update docs/changelog/111852.yaml

451766f

Add serialization of deberta tokenization

18b1c2e

fix request buillder to match model

677e3ab

maxhniebergall added 2 commits September 17, 2024 16:39

fix order of parameters passed to deberta

209361f

Add support for byte_fallback which is enabled for DeBERTa

82fd7b4

byte_fallback decomposes unknown tokens into multiple tokens each of one byte if those bytes are in the vocabulary.

maxhniebergall force-pushed the addDebertaTokenizer branch from e0567df to 82fd7b4 Compare September 17, 2024 20:39

maxhniebergall requested review from davidkyle and removed request for a team September 17, 2024 20:40

precommit

9517916

elasticmachine and others added 4 commits September 18, 2024 14:45

Merge branch 'main' into addDebertaTokenizer

b7bfa5e

update tests to account for byte decomposition

18cc4c5

remove sysout

d1eb2c9

fix tests for byteFallback, for real this time

d31f8b5

davidkyle reviewed Oct 2, 2024

View reviewed changes

davidkyle approved these changes Oct 3, 2024

View reviewed changes

maxhniebergall added 2 commits October 3, 2024 09:31

Move defaultSpanForChunking into super class to avoid repitition

bddf70b

simplify decomposeBytePieces

df3ffca

maxhniebergall added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) v8.16.0 and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Oct 3, 2024

Merge branch 'main' of github.com:elastic/elasticsearch into addDeber…

65e840c

…taTokenizer

maxhniebergall added the auto-backport-and-merge label Oct 3, 2024

maxhniebergall merged commit ff53cb7 into main Oct 3, 2024
17 checks passed

maxhniebergall deleted the addDebertaTokenizer branch October 3, 2024 15:00

elasticsearchmachine added the backport pending label Oct 3, 2024

maxhniebergall mentioned this pull request Oct 3, 2024

[ML] Backport Add DeBERTa tokenizer #114053

Merged

[ML] Add DeBERTa-V2/V3 tokenizer #111852

[ML] Add DeBERTa-V2/V3 tokenizer #111852

Uh oh!

Conversation

maxhniebergall commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 13, 2024

Uh oh!

Uh oh!

maxhniebergall Sep 5, 2024

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxhniebergall commented Sep 11, 2024

Uh oh!

maxhniebergall commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxhniebergall commented Sep 17, 2024

Uh oh!

maxhniebergall commented Sep 18, 2024

Uh oh!

maxhniebergall commented Oct 2, 2024

Uh oh!

davidkyle Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

davidkyle Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 3, 2024

💔 Backport failed

Uh oh!

Uh oh!

maxhniebergall commented Aug 13, 2024 •

edited

Loading

maxhniebergall commented Sep 13, 2024 •

edited

Loading