Characterizing LLM Performance via a Bayesian Lens

Abstract

Robust evaluation of large language models (LLMs) is critical, yet standard benchmarks relying on point estimates often fall short by overlooking the inherent response stochasticity and heterogeneity across questions or tasks. Here we introduce a hierarchical Bayesian Beta-Binomial framework for comprehensive LLM benchmarking on multiple-choice question datasets. Our approach models the amount of correct responses as a binomial distribution and decomposes the output variation in intra-question stochasticity (response variability for a given question) and inter-question heterogeneity (variation in difficulty across questions) by modeling them as separate priors, providing a more holistic probabilistic understanding of performance. The framework yields a probabilistic assessment, providing full posterior distributions and credible intervals for key dimensions: mean accuracy, inter-question heterogeneity, and mean intra-question response variability, thereby enabling more rigorous uncertainty quantification. We demonstrate its utility by evaluating multiple LLMs across diverse benchmarks, including under semantic perturbations like question rephrasing that simulate real-world query variations. This analysis reveals nuanced insights into model robustness and uncovers distinct behaviors across model classes (e.g., reasoning vs. non-reasoning) that could not be discerned with a traditional frequentist uncertainty approach. This methodology offers a statistically grounded and powerful lens for analyzing LLM capabilities, providing deeper insights into accuracy, consistency, and heterogeneity essential for reliable model development, deployment and comparison.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis_output_comparison_betabinom		analysis_output_comparison_betabinom
csv		csv
data		data
docs		docs
plot_data		plot_data
plots		plots
rephrased_data		rephrased_data
rephrased_results		rephrased_results
results		results
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENCE		LICENCE
README.md		README.md
__init__.py		__init__.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Characterizing LLM Performance via a Bayesian Lens

Abstract

About

Uh oh!

Releases

Packages

Languages

License

pfizer-opensource/llm-uncertainty

Folders and files

Latest commit

History

Repository files navigation

Characterizing LLM Performance via a Bayesian Lens

Abstract

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages