University of Michigan — EECS 498: Machine Learning Research Experience
Group 13 Capstone ProjectThis repository contains our work for the ML Research Experience capstone. The project consists of two phases:
- Replication: A reproduction of the findings from Improving Factuality and Reasoning in Language Models through Multiagent Debate (Du et al., 2023).
- Extension: A novel method proposing DOWN-HMAD (Debate Only When Necessary - Heterogeneous Multi-Agent Debate) to optimize the trade-off between reasoning accuracy and computational cost.
Large Language Models (LLMs) frequently suffer from hallucinations and "post-hoc rationalization" during complex reasoning tasks. While Multi-Agent Debate (MAD) has been shown to improve factuality, it introduces massive computational overhead (~15x FLOPs) and suffers from "echo chamber" effects when using homogeneous agents.
This project implements DOWN-HMAD, a novel framework that improves efficiency and reasoning diversity by:
- Gating: Utilizing a "gatekeeper" model to assess token confidence, triggering expensive debate only when necessary.
- Heterogeneity: Replacing identical agents with a diverse panel of specialized models (Generalist, Math Specialist, Reasoning Specialist) to mitigate correlated errors.
Key Results:
- 40% reduction in total compute (FLOPs) on knowledge retrieval tasks.
- 18% accuracy improvement on Biography generation compared to single-agent baselines.
- Discovery of "Syntactic Determinism": Identified a critical failure mode in GSM8K where models report high confidence on incorrect answers due to correct formatting (e.g.,
####tokens), fooling the gating mechanism.
The system utilizes a confidence-driven pipeline. If the Gatekeeper's confidence (
graph TD
Q[Input Query] --> Gate["Gatekeeper Agent<br>(Llama-3.1-8B)"]
Gate --> Gen{Initial Generation}
Gen --> |Compute Confidence| C[Confidence Score c1]
C -- "c1 > 0.8" --> Output[Return Initial Answer]
C -- "c1 <= 0.8" --> Debate[Trigger DOWN-HMAD]
subgraph "Heterogeneous Panel"
A1["Llama-3.1-8B<br>(Generalist)"]
A2["DeepSeek-R1-Qwen<br>(Reasoning)"]
A3["Mathstral-7B<br>(Math Expert)"]
end
Debate --> A1 & A2 & A3
A1 & A2 & A3 --> Round1[Round 1: Cross-Seeding]
Round1 --> Round2[Round 2: Critique & Refine]
Round2 --> Vote[Majority Voting]
Vote --> Output
The system is built on a strict Factory Pattern to allow seamless swapping between API-based and Local LLMs.
llm/core/: Defines the abstractLLMbase class andLLMConfigdataclasses.llm/implementations/:local_llm.py: Handles HuggingFace/vLLM loading, quantization, and local inference.api_llm.py: Interface for OpenAI/Anthropic APIs.
llm/factory.py: Instantiates the correct model class based on JSON configuration files.
The repository is organized to separate core framework logic from specific experimental setups.
llm_multi_agent/
├── llm/ # CORE FRAMEWORK
│ ├── core/ # Abstract Base Classes & Configs
│ ├── implementations/ # Local (HF) & API (OpenAI) Wrappers
│ ├── configs/ # Model parameters (DeepSeek, Llama, etc.)
│ └── factory.py # Model Instantiation Factory
│
├── extension/ # DOWN-HMAD EXPERIMENTS (The Extension)
│ ├── biography/ # Bio Generation with Confidence Gating
│ ├── gsm/ # Math Reasoning with Confidence Gating
│ ├── mmlu/ # MMLU Benchmark with Confidence Gating
│ └── run_all_extensions.sh # HPC Job Submission Script
│
├── replication/ # ORIGINAL PAPER REPLICATION (Baselines)
│ ├── gpt-3.5-turbo-replication/ # API-based Replication
│ └── open-source-replication/ # Llama-3.1-8B Replication
│
└── requirements.txt # Project Dependencies
- Python 3.10+
- CUDA-enabled GPU (Required for
LocalLLM/ Open Source models) - High RAM (>32GB recommended for loading 8B+ models)
# 1. Clone the repository
git clone https://github.com/Wenjia-Lu/llm_multi_agent.git
cd llm_multi_agent
# 2. Create Virtual Environment
python -m venv venv
source venv/bin/activate
# 3. Install Dependencies
pip install -r requirements.txtThe extension uses the heterogeneous panel and confidence gating. You must specify the confidence_threshold (default 0.8).
Biography Generation:
cd extension/biography
# Generate responses using DOWN-HMAD
# --agents 3 indicates usage of the heterogeneous panel
python gen_OS.py --rounds 3 --confidence_threshold 0.8 --agents 3
# Evaluate accuracy and hallucinations
python eval_OS.py --agents 3 --rounds 3 <generated_output_file.json>Grade School Math (GSM8K):
cd extension/gsm
python gen_OS.py --rounds 3 --confidence_threshold 0.8To replicate the original Du et al. paper results without confidence gating or heterogeneity.
Open Source Baseline (Llama-3.1):
cd replication/open-source-replication/mmlu
python gen_OS.pyGPT-3.5 Baseline (API):
cd replication/gpt-3.5-turbo-replication/math
# Requires OPENAI_API_KEY env var
python gen_math.py| Task | Method | Accuracy | FLOPs (Compute Cost) |
|---|---|---|---|
| Biography | Single Agent | 55% | 1.30e14 |
| DOWN-HMAD | 73% | 4.49e14 | |
| Standard MAD | 54% | 2.01e15 | |
| GSM8K | Single Agent | 76% | 1.44e14 |
| DOWN-HMAD | 30%* | 2.74e14 |
- Efficiency: DOWN-HMAD achieved a 12x compute reduction on "easy" queries by successfully routing them to the single agent, while correctly elevating "hard" queries to the debate panel.
- The "Syntactic Determinism" Failure: In GSM8K, the system failed to trigger debate (leading to 30% accuracy). Analysis revealed that models assigned 100% confidence to tokens like
$and=even when the arithmetic was wrong. The gatekeeper was fooled by correct syntax, creating a false negative for the debate trigger.
Group 13 - University of Michigan
- Ethan Justice - Extension implementation, generalized LLM wrapper.
- Satyak Khare - vLLM refactoring, HPC deployment, Replication codebase.
- Wenjia Lu - Replication codebase, benchmark evaluations.
- Daniel Vega - Data visualization, Methodology, plotting.
- Eli Wiegman - Original paper selection, visualization, setup.
- Christopher Zhou - Project management, roadmap strategy.
- Du, Y. et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate.
- Eo, S. et al. (2025). Debate Only When Necessary (DOWN).
- Ye, R. et al. (2025). X-MAS: Heterogeneous LLM-driven MAS.