OpenCompass

All

41 repositories

RePro
Public
[Preprint 2025] Rectifying LLM Thought From Lens of Optimization
reinforcement-learning large-language-model large-language-model-reasoning
Python
•
MIT License
•3•8•0•0•Updated Dec 5, 2025Dec 5, 2025
opencompass
Public
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
benchmark evaluation openai llm chatgpt large-language-model llama2 llama3
Python
•
Apache License 2.0
•698•6.4k•359•64•Updated Dec 5, 2025Dec 5, 2025
VLMEvalKit
Public
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip
Python
•
Apache License 2.0
•573•3.5k•183•25•Updated Dec 5, 2025Dec 5, 2025
SAGA
Public
The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
0•10•0•0•Updated Nov 27, 2025Nov 27, 2025
ATLAS
Public
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
0•5•0•0•Updated Nov 20, 2025Nov 20, 2025
OASIS
Public
Python
•0•2•0•0•Updated Nov 12, 2025Nov 12, 2025
InteractScience
Public
JavaScript
•
Apache License 2.0
•0•7•0•0•Updated Oct 31, 2025Oct 31, 2025
CognitiveKernel-Pro
Public
Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
Python
•
Other
•45•0•0•0•Updated Oct 27, 2025Oct 27, 2025
GAOKAO-Eval
Public
Jupyter Notebook
•7•111•5•0•Updated Oct 7, 2025Oct 7, 2025
.github
Public
1•0•0•0•Updated Sep 9, 2025Sep 9, 2025
MMBench-GUI
Public
Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, including Windows, Linux, macOS, iOS, Android and Web.
benchmark-framework vision-language-model computer-use gui-agent
Python
•3•86•5•0•Updated Sep 8, 2025Sep 8, 2025
ReasonZoo
Public
Python
•
Apache License 2.0
•0•3•0•0•Updated Aug 27, 2025Aug 27, 2025
CompassVerifier
Public
[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Jupyter Notebook
•2•57•0•0•Updated Aug 10, 2025Aug 10, 2025
GPassK
Public
[ACL 2025] Are Your LLMs Capable of Stable Reasoning?
large-language-model-evaluation reasoning-stability
Python
•2•31•2•0•Updated Aug 5, 2025Aug 5, 2025
Creation-MMBench
Public
Assessing Context-Aware Creative Intelligence in MLLMs
JavaScript
•0•23•1•0•Updated Jul 22, 2025Jul 22, 2025
CompassJudger
Public
The All-in-one Judge Models introduced by Opencompass
Apache License 2.0
•5•114•1•0•Updated Jul 15, 2025Jul 15, 2025
RaML
Public
[Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
Jupyter Notebook
•2•6•0•0•Updated May 27, 2025May 27, 2025
BotChat
Public
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Jupyter Notebook
•
Apache License 2.0
•6•159•2•0•Updated May 22, 2025May 22, 2025
Ada-LEval
Public
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
gpt4 llm long-context
Python
•3•55•0•0•Updated May 22, 2025May 22, 2025
MathBench
Public
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
Apache License 2.0
•1•109•5•0•Updated May 22, 2025May 22, 2025
MMBench
Public
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
Apache License 2.0
•15•273•12•0•Updated May 22, 2025May 22, 2025
ProSA
Public
[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Python
•
Apache License 2.0
•2•29•0•0•Updated May 22, 2025May 22, 2025
ANAH
Public
[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
acl alignment gpt iclr neurips llms hallucination-detection hallucination-mitigation
Python
•
Apache License 2.0
•4•59•1•0•Updated Apr 30, 2025Apr 30, 2025
GTA
Public
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
llm-agent llm-evaluation
Python
•
Apache License 2.0
•7•130•1•0•Updated Mar 28, 2025Mar 28, 2025
oc_doc_website
Public
0•0•0•0•Updated Feb 12, 2025Feb 12, 2025
CriticEval
Public
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
Python
•
Apache License 2.0
•2•48•0•0•Updated Nov 29, 2024Nov 29, 2024
lagent-cibench
Public
Python
•
Apache License 2.0
•1•2•0•0•Updated Sep 23, 2024Sep 23, 2024
hinode
Public
A clean documentation and blog theme for your Hugo site based on Bootstrap 5
HTML
•
MIT License
•63•0•0•0•Updated Sep 1, 2024Sep 1, 2024
storage
Public
Apache License 2.0
•0•0•0•0•Updated Aug 18, 2024Aug 18, 2024
CompassBench
Public
Demo data of CompassBench
3•11•3•0•Updated Aug 7, 2024Aug 7, 2024