server: improve speed of speculative decoding #17808

ngxson · 2025-12-05T23:51:30Z

Fix #12968

I'm testing with:

draft model: https://huggingface.co/unsloth/Qwen3-0.6B-GGUF (using Q8_0)
main model: https://huggingface.co/unsloth/Qwen3-8B-GGUF (using Q4_K_M)

So far the results are coherent.

How it works:

ngxson · 2025-12-06T14:53:33Z

server tests passed locally, this should be ready for review @ggerganov

server: improve speed of speculative decoding

f2f08f8

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17808: server: improve speed of speculative decoding auroralabs-loci/llama.cpp#463

Open

github-actions bot added examples server labels Dec 6, 2025

fix small draft case

cac8d7b

ngxson marked this pull request as ready for review December 6, 2025 14:53

ngxson requested a review from ggerganov as a code owner December 6, 2025 14:53

add link to the PR

398ae8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: improve speed of speculative decoding #17808

server: improve speed of speculative decoding #17808

ngxson commented Dec 5, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

server: improve speed of speculative decoding #17808

Are you sure you want to change the base?

server: improve speed of speculative decoding #17808

Conversation

ngxson commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ngxson commented Dec 5, 2025 •

edited

Loading