-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[ML] Add DeBERTa-V2/V3 tokenizer #111852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add DeBERTa-V2/V3 tokenizer #111852
Conversation
Hi @maxhniebergall, I've created a changelog YAML for you. |
...ml/src/test/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/DebertaV3TestVocab.java
Outdated
Show resolved
Hide resolved
tokenIds.add(IntStream.of(sepTokenId)); | ||
tokenMap.add(IntStream.of(SPECIAL_TOKEN_POSITION)); | ||
} | ||
seqPairOffset = withSpecialTokens ? tokenId1s.size() + 2 : tokenId1s.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the final special token isn't included in the offset
Pinging @elastic/ml-core (Team:ML) |
...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java
Outdated
Show resolved
Hide resolved
...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java
Outdated
Show resolved
Hide resolved
...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java
Outdated
Show resolved
Hide resolved
...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java
Show resolved
Hide resolved
...ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/DebertaV2Tokenizer.java
Show resolved
Hide resolved
@elasticmachine merge upstream |
theres an issue with the tokenizer to do with this special character In our system, that character is tokenized as I believe I figured out why this is happening: by default I guess there might be all kinds of other settings that could be different as well... |
3ea6abd
to
d2e347e
Compare
Still need to review & test
byte_fallback decomposes unknown tokens into multiple tokens each of one byte if those bytes are in the vocabulary.
e0567df
to
82fd7b4
Compare
I've tested the tokenizer on the first 6000 query-passage pairs in MS-MARCO and the differences in score between the direct-pytorch and elasticsearch versions are around 10^-6 at most (a few low 10^-5), and the total variance was |
@elasticmachine merge upstream |
Manual testing revealed that any differences in tokenization are due to floating point rounding in scores and are not fixable. |
} | ||
|
||
@Override | ||
int defaultSpanForChunking(int maxWindowSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: All the implementations of this method are the same it should be implemented in the base class NlpTokenizer
.
It's not related to your change but would be nice to clean up
if (node.id == unknownTokenId && fuseUnk) { | ||
if (node.id == unknownTokenId && byteFallback) { | ||
CharSequence multiByteSequence = inputSequence.subSequence(node.startsAtCharPos, endsAtChars); | ||
byte[] bytes = multiByteSequence.toString().getBytes(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is also in decomposeBytePieces
. Can decomposeBytePieces
take a byte []
argument only do this once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Commit 54f9995#diff-2f9a331e12c3fb7602085cb04501a93f424bafa029864c2f798b83860e9679b7 includes end-to-end testing of tokenization with a real vocab file from Huggingface, but it was too large to include in CI.
Thanks, this was really useful for verifying the tokenisation.
💔 Backport failedThe backport operation could not be completed due to the following error:
You can use sqren/backport to manually backport by running |
* start to add deberta-v2 tokenizer classes * continue to add basic tokenizer stuff * Finish adding DeBERTa-2 tokenizer Still need to review & test * Complete test setup and linting * Update docs/changelog/111852.yaml * Add serialization of deberta tokenization * fix request buillder to match model * debugging * add balanced truncation * remove full vocabulary and use tiny vocab for tests * Remove TODO * precommit * Add named writables and known tokenizers * Add deberta to list of known tokenizers in test * Add tests for balanced tokenizer and fix errors in tokenizer logic * fix order of parameters passed to deberta * Add support for byte_fallback which is enabled for DeBERTa byte_fallback decomposes unknown tokens into multiple tokens each of one byte if those bytes are in the vocabulary. * precommit * update tests to account for byte decomposition * remove sysout * fix tests for byteFallback, for real this time * Move defaultSpanForChunking into super class to avoid repitition * simplify decomposeBytePieces --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This PR adds support for the DeBERTa-V2/V3 tokenizer (DeBERTa-V2 and V3 use the same tokenizer). It also adds the
balanced
truncation method.This PR contains:
Commit 54f9995#diff-2f9a331e12c3fb7602085cb04501a93f424bafa029864c2f798b83860e9679b7 includes end-to-end testing of tokenization with a real vocab file from Huggingface, but it was too large to include in CI.