Skip to content
Change the repository type filter

All

    Repositories list

    • RePro

      Public
      [Preprint 2025] Rectifying LLM Thought From Lens of Optimization
      Python
      3800Updated Dec 5, 2025Dec 5, 2025
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
      Python
      6986.4k35964Updated Dec 5, 2025Dec 5, 2025
    • VLMEvalKit

      Public
      Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
      Python
      5733.5k18325Updated Dec 5, 2025Dec 5, 2025
    • SAGA

      Public
      The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
      01000Updated Nov 27, 2025Nov 27, 2025
    • ATLAS

      Public
      ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
      0500Updated Nov 20, 2025Nov 20, 2025
    • OASIS

      Public
      Python
      0200Updated Nov 12, 2025Nov 12, 2025
    • JavaScript
      0700Updated Oct 31, 2025Oct 31, 2025
    • Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
      Python
      45000Updated Oct 27, 2025Oct 27, 2025
    • Jupyter Notebook
      711150Updated Oct 7, 2025Oct 7, 2025
    • .github

      Public
      1000Updated Sep 9, 2025Sep 9, 2025
    • Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, including Windows, Linux, macOS, iOS, Android and Web.
      Python
      38650Updated Sep 8, 2025Sep 8, 2025
    • ReasonZoo

      Public
      Python
      0300Updated Aug 27, 2025Aug 27, 2025
    • [EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
      Jupyter Notebook
      25700Updated Aug 10, 2025Aug 10, 2025
    • GPassK

      Public
      [ACL 2025] Are Your LLMs Capable of Stable Reasoning?
      Python
      23120Updated Aug 5, 2025Aug 5, 2025
    • Assessing Context-Aware Creative Intelligence in MLLMs
      JavaScript
      02310Updated Jul 22, 2025Jul 22, 2025
    • The All-in-one Judge Models introduced by Opencompass
      511410Updated Jul 15, 2025Jul 15, 2025
    • RaML

      Public
      [Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
      Jupyter Notebook
      2600Updated May 27, 2025May 27, 2025
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      615920Updated May 22, 2025May 22, 2025
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      35500Updated May 22, 2025May 22, 2025
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      110950Updated May 22, 2025May 22, 2025
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      15273120Updated May 22, 2025May 22, 2025
    • ProSA

      Public
      [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
      Python
      22900Updated May 22, 2025May 22, 2025
    • ANAH

      Public
      [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
      Python
      45910Updated Apr 30, 2025Apr 30, 2025
    • GTA

      Public
      [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
      Python
      713010Updated Mar 28, 2025Mar 28, 2025
    • 0000Updated Feb 12, 2025Feb 12, 2025
    • [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      24800Updated Nov 29, 2024Nov 29, 2024
    • Python
      1200Updated Sep 23, 2024Sep 23, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      63000Updated Sep 1, 2024Sep 1, 2024
    • storage

      Public
      0000Updated Aug 18, 2024Aug 18, 2024
    • Demo data of CompassBench
      31130Updated Aug 7, 2024Aug 7, 2024