Skip to content

Conversation

@aldehir
Copy link
Collaborator

@aldehir aldehir commented Dec 6, 2025

Implementation of idea by @ngxson: #17750 (comment)

cc: @pwilkin @aviallon

Problem

The llama-grammar implementation doesn't have a way to accept tokens directly, which creates a few problems:

  • Can't disambiguate between a special token (e.g. <|end|>) and the tokenized form <|, end, |> that may occur in content.
  • Requires awkward "exclusion" rules such as ( [^<] | "<" [^|] | "<|" [^e] | ... | "<|end|" [^>] )* to match chunks of characters that don't accumulate to the desired delimiter (<|end|>).
  • Adds extra work to grammar sampling from recursively applying character rules.

Proposed Solution

Borrowing some ideas from llguidance, you can define a token by id <[id]> or as raw token text <token> if encased in </>. I'm leaving out support for token id ranges/alternates since I don't see an immediate need for it.

You can negate by prefixing the token with !, e.g. !<|end|>.

Example (gpt-oss)

By token id:

root ::= analysis response
analysis ::= <[200005]> "analysis" <[200008]> (!<[200007]>)* <[200007]>
response ::= <[200006]> "assistant" <[200005]> "final" <[200008]> .*

That's not very readable, but useful for tokens not wrapped in </>. If they are, you can use them directly:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> (!<|end|>)* <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

Use Case: Reasoning Budget Enforcement

Assuming the model's vocab has unique tokens for its thinking tags, adopting a reasoning budget is fairly trivial via grammar:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> reasoning-with-budget
reasoning-with-budget ::= (!<|end|>){0,200} <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

# optionally, inject pieces to guide the model when it goes over
reasoning-with-budget ::= (!<|end|>){0,200} (<|end|> | "--I need to provide an immediate response" <|end|>)

Notes:

  • It is important the grammar is unambiguous, otherwise the model may find a way to continue thinking via other paths in the grammar.
  • gpt-oss may be a poor example since it has reasoning_effort, but the budget approach works pretty well.

To Do

  • Implement token support in llama-grammar
  • Refactor trigger_patterns to collect tokens and replay them after a successful trigger. Support partial token matches by feeding only the matched piece to the grammar.
  • Update grammar documentation and provide an example in grammars/

AI Disclosure: LLM was used to help understand the grammar code, assist in writing documentation and test cases, and review implementations. All output generated by an LLM has been reviewed.

@aldehir aldehir changed the title llama : add token support to llama-grammar llama : add token matching support to llama-grammar Dec 6, 2025
@aviallon
Copy link
Contributor

aviallon commented Dec 6, 2025

Very interesting. I'll see what I can build upon that.

@github-actions github-actions bot added the testing Everything test related label Dec 6, 2025
@aldehir aldehir marked this pull request as ready for review December 7, 2025 00:19
@aldehir aldehir requested a review from ggerganov as a code owner December 7, 2025 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants