Skip to content

Tokenize named character references using a DAFSA #645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

simonwuelker
Copy link
Contributor

@simonwuelker simonwuelker commented Jul 23, 2025

Currently named character references are implemented using a phf map that is repeatedly queried for each character. This works, but has suboptimal performance and a significant impact on binary size.

Traversing a DAFSA that is generated at compile time makes tokenizing named character references 30% faster. This technique is described in https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/. For illustration, a reduced version of the dafsa is can be viewed here.

Apologies for the big change. If it's too hard to review we could also merge the DAFSA incrementally. Most of the diff is the list of named entities being moved around and a benchmark file being added.

I have not looked into how this affects the binary size. Some more savings are possible by packing the array of result characters.

Do not merge yet - needs servo companion PR.

This is a breaking change for markup5ever and web_atoms.

Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
@nicoburns
Copy link
Contributor

Would it be possible to separate out the DAFSA so that it doesn't depend on markup5ever traits? A small/fast HTML entity parser (possibly following a "sans-io" design?) seems like a really useful standalone crate.

btw, is a "DAFSA" the same thing as what burntsushi calls a "DFA" in his blogposts on the regex crate?

Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
@simonwuelker
Copy link
Contributor Author

Would it be possible to separate out the DAFSA so that it doesn't depend on markup5ever traits? A small/fast HTML entity parser (possibly following a "sans-io" design?) seems like a really useful standalone crate.

Yeah sure, I wasn't (and am still not) sure if adding another crate is worth it. Though it would likely only require very little maintenance, because the list of named entities will never change...

@simonwuelker
Copy link
Contributor Author

btw, is a "DAFSA" the same thing as what burntsushi calls a "DFA" in his blogposts on the regex crate?

Pretty much, yes - a DFA is a deterministic finite state automaton. DAFSAs are a subset of that, which are acyclic. (DAFSA = deterministic acyclic finite state automaton).

@simonwuelker simonwuelker marked this pull request as ready for review July 30, 2025 18:00
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V-breaking Breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants