llama : add token matching support to llama-grammar #17816
+400
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implementation of idea by @ngxson: #17750 (comment)
cc: @pwilkin @aviallon
Problem
The
llama-grammarimplementation doesn't have a way to accept tokens directly, which creates a few problems:<|end|>) and the tokenized form<|, end, |>that may occur in content.( [^<] | "<" [^|] | "<|" [^e] | ... | "<|end|" [^>] )*to match chunks of characters that don't accumulate to the desired delimiter (<|end|>).Proposed Solution
Borrowing some ideas from llguidance, you can define a token by id
<[id]>or as raw token text<token>if encased in</>. I'm leaving out support for token id ranges/alternates since I don't see an immediate need for it.You can negate by prefixing the token with
!, e.g.!<|end|>.Example (gpt-oss)
By token id:
That's not very readable, but useful for tokens not wrapped in
</>. If they are, you can use them directly:Use Case: Reasoning Budget Enforcement
Assuming the model's vocab has unique tokens for its thinking tags, adopting a reasoning budget is fairly trivial via grammar:
Notes:
gpt-ossmay be a poor example since it hasreasoning_effort, but the budget approach works pretty well.To Do
llama-grammartrigger_patternsto collect tokens and replay them after a successful trigger. Support partial token matches by feeding only the matched piece to the grammar.grammars/AI Disclosure: LLM was used to help understand the grammar code, assist in writing documentation and test cases, and review implementations. All output generated by an LLM has been reviewed.