Skip to content

Conversation

@Susan9001
Copy link
Contributor

@Susan9001 Susan9001 commented Nov 28, 2025

Details

This PR is a follow-up to #4229 and makes DashScope Qwen more robust as a GEval judge model when used via LiteLLMChatModel.

Currently, when a model advertises logprobs and top_logprobs support, GEval enables the logprobs-aware scoring path. For DashScope Qwen this can occasionally lead to MetricComputationError("Failed to calculate g-eval score") because the returned logprobs do not always match the OpenAI-style format expected by the parser.

This PR treats DashScope Qwen as not logprobs-supported in this context, so GEval falls back to the standard text/JSON-based parsing path instead of relying on logprobs.

Change checklist

  • User facing
  • Documentation update

Issues

Testing

Locally:

  • pytest tests/unit/evaluation/models/test_litellm_chat_model.py
  • Ran more examples with dashscope/qwen-flash as the judge model with code snippets:
          self.judge_model = models.LiteLLMChatModel(
              model_name=judge_model_name,
              api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
              api_key=os.getenv("DASHSCOPE_API_KEY"),
          )
    All samples now score successfully without Failed to calculate g-eval score.

Documentation

@Susan9001 Susan9001 requested a review from a team as a code owner November 28, 2025 20:37
@yaricom
Copy link
Contributor

yaricom commented Nov 30, 2025

Hi @Susan9001 ! Thank you for a contribution! Please fix merge conflicts with current branch.

Cheers,
Iaroslav

@Susan9001
Copy link
Contributor Author

Hi @yaricom ,
I have just resolved the merge conflicts. Sorry it took me a bit to get back to this. Please let me know if anything else I need to adjust.

@yaricom yaricom merged commit 59a883c into comet-ml:main Dec 12, 2025
35 of 38 checks passed
@yaricom
Copy link
Contributor

yaricom commented Dec 12, 2025

Hi @Susan9001 ! Thank you for the contribution!

Happy coding!
Iaroslav

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: GEval LiteLLMChatModel with DashScope Qwen sometimes fails to calculate g-eval score

2 participants