Skip to content

pfizer-opensource/llm-uncertainty

Repository files navigation

Python 3.12 uv-managed Linted with Ruff License: MIT

Characterizing LLM Performance via a Bayesian Lens

Abstract

Robust evaluation of large language models (LLMs) is critical, yet standard benchmarks relying on point estimates often fall short by overlooking the inherent response stochasticity and heterogeneity across questions or tasks. Here we introduce a hierarchical Bayesian Beta-Binomial framework for comprehensive LLM benchmarking on multiple-choice question datasets. Our approach models the amount of correct responses as a binomial distribution and decomposes the output variation in intra-question stochasticity (response variability for a given question) and inter-question heterogeneity (variation in difficulty across questions) by modeling them as separate priors, providing a more holistic probabilistic understanding of performance. The framework yields a probabilistic assessment, providing full posterior distributions and credible intervals for key dimensions: mean accuracy, inter-question heterogeneity, and mean intra-question response variability, thereby enabling more rigorous uncertainty quantification. We demonstrate its utility by evaluating multiple LLMs across diverse benchmarks, including under semantic perturbations like question rephrasing that simulate real-world query variations. This analysis reveals nuanced insights into model robustness and uncovers distinct behaviors across model classes (e.g., reasoning vs. non-reasoning) that could not be discerned with a traditional frequentist uncertainty approach. This methodology offers a statistically grounded and powerful lens for analyzing LLM capabilities, providing deeper insights into accuracy, consistency, and heterogeneity essential for reliable model development, deployment and comparison.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published