-
Notifications
You must be signed in to change notification settings - Fork 241
Tokenize named character references using a DAFSA #645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
729b0d4
to
32ecdef
Compare
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Would it be possible to separate out the DAFSA so that it doesn't depend on markup5ever traits? A small/fast HTML entity parser (possibly following a "sans-io" design?) seems like a really useful standalone crate. btw, is a "DAFSA" the same thing as what burntsushi calls a "DFA" in his blogposts on the |
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Yeah sure, I wasn't (and am still not) sure if adding another crate is worth it. Though it would likely only require very little maintenance, because the list of named entities will never change... |
Pretty much, yes - a DFA is a deterministic finite state automaton. DAFSAs are a subset of that, which are acyclic. (DAFSA = deterministic acyclic finite state automaton). |
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Currently named character references are implemented using a
phf
map that is repeatedly queried for each character. This works, but has suboptimal performance and a significant impact on binary size.Traversing a DAFSA that is generated at compile time makes tokenizing named character references 30% faster. This technique is described in https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/. For illustration, a reduced version of the dafsa is can be viewed here.
Apologies for the big change. If it's too hard to review we could also merge the DAFSA incrementally. Most of the diff is the list of named entities being moved around and a benchmark file being added.
I have not looked into how this affects the binary size. Some more savings are possible by packing the array of result characters.
Do not merge yet - needs servo companion PR.
This is a breaking change for
markup5ever
andweb_atoms
.