Skip to content

Commit a7428ef

Browse files
authored
Merge pull request #334 from linhaowei1/main
Add SLDBench
2 parents 9290243 + b82ac2c commit a7428ef

13 files changed

+1538
-0
lines changed

examples/sldbench/README.md

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# SLDBench — Scaling Law Discovery Benchmark
2+
3+
## Introduction
4+
5+
**SLDBench** is a benchmark for discovering scaling laws, originally introduced in the paper [*Can Language Models Discover Their Own Scaling Laws?*](https://arxiv.org/abs/2507.21184) by Lin et al. It aggregates over **5,000 LLM training experiments** from recent scaling-law literature into a unified dataset, hosted on the Hugging Face Hub at [`pkuHaowei/sldbench`](https://huggingface.co/datasets/pkuHaowei/sldbench).
6+
7+
Also check this [blog](https://algorithmicsuperintelligence.ai/blog/openevolve-sldagent/) for quickly understanding OpenEvolve x SLDBench.
8+
9+
## Overview
10+
11+
SLDBench focuses on **discovery** rather than simple curve fitting. The agent must identify:
12+
13+
- A **symbolic law** $f_\theta(x)$ (the functional form).
14+
- A **parameter fitting routine** that generalizes across multiple training scenarios.
15+
16+
**Key Features:**
17+
18+
- **Data Source:** All task data is pulled dynamically from the Hugging Face dataset.
19+
- **Extrapolation Evaluation:** Models are trained on smaller-scale runs and strictly evaluated on held-out, larger-scale configurations to test predictive capability.
20+
- **Evolutionary Loop:** OpenEvolve iteratively mutates and evaluates candidate implementations of `scaling_law_func(...)` (the symbolic law) and `fit_scaling_law(...)` (the optimizer).
21+
22+
------
23+
24+
## SLDBench Tasks
25+
26+
There are currently 7 core scaling-law discovery tasks, each derived from real-world LLM experiments. Configuration files for these tasks are located in `examples/sldbench/configs/`.
27+
28+
| **Task Name (Config)** | **Scenario** | **Inputs (X)** | **Target (y)** |
29+
| -------------------------------- | -------------------------------------------- | ---------------------------------------------------- | ------------------------------ |
30+
| **parallel_scaling_law** | Parallel / Best-of-N inference scaling. | Model size $N$, Parallelism $P$ | Loss $L(N, P)$ |
31+
| **vocab_scaling_law** | Vocabulary size vs. model/data scaling. | Non-vocab size $N$, Vocab size $V$, Dataset size $D$ | Unigram-normalized loss $L$ |
32+
| **sft_scaling_law** | Supervised Fine-Tuning (SFT). | SFT dataset size $D$ | Fine-tuning loss $L(D)$ |
33+
| **domain_mixture_scaling_law** | Multi-domain pre-training mixtures. | Domain mixture proportions $r$ | Per-domain losses $\{L_i(r)\}$ |
34+
| **moe_scaling_law** | Mixture-of-Experts (MoE) scaling. | Network size $N$, Experts $E$ | Pre-training loss $L(N, E)$ |
35+
| **data_constrained_scaling_law** | Data-constrained pre-training regimes. | Model size $N$, Dataset size $D$, Unique tokens $U$ | Loss $L(N, D, U)$ |
36+
| **lr_bsz_scaling_law** | Joint Learning Rate / Batch Size (Step Law). | LR $l$, Batch size $b$, Dataset $D$, Model $N$ | Loss $L(l, b, D, N)$ & Optima |
37+
38+
> **Note:** A task named `easy_question_scaling_law` is also included for U-shape scaling studies, though it is not part of the current paper reference.
39+
40+
------
41+
42+
## File Structure
43+
44+
- `configs/` — YAML configuration files defining data splits, features, targets, and evaluation settings for each task.
45+
- `data_loader.py` — Unified data loader for the `pkuHaowei/sldbench` Hugging Face dataset.
46+
- `evaluator.py` — The evaluation framework; handles data splitting (train/extrapolate) and metric computation.
47+
- `init_program.py` — The seed implementation (a power-law–style baseline) to jumpstart the evolutionary search.
48+
49+
------
50+
51+
## Usage
52+
53+
### Configuration Prerequisites
54+
55+
Before running any tasks, ensure your API key environment variable is set for your API provider (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`).
56+
57+
### Running Individual Tasks
58+
59+
To run the evolutionary process for a specific task:
60+
61+
```bash
62+
python openevolve-run.py \
63+
examples/sldbench/init_program.py \
64+
examples/sldbench/evaluator.py \
65+
--config examples/sldbench/configs/sft_scaling_law.yaml \
66+
--api-base "https://api.openai.com/v1" \
67+
--iterations 50
68+
```
69+
70+
To switch tasks, simply point the `--config` argument to a different YAML file found in `examples/sldbench/configs/`.
71+
72+
### Automated Benchmark Script
73+
74+
A complete benchmark script `run.sh` is provided for running multiple tasks across different models with parallelism:
75+
76+
```bash
77+
cd examples/sldbench
78+
chmod +x run.sh
79+
./run.sh 4 # Run with parallelism degree of 4 per model
80+
```
81+
82+
**Note**: Configure the `API_BASE` variable in the script (defaults to `https://api.openai.com/v1`) and ensure your API key environment variable is set.
83+
84+
The script handles evolution and evaluation automatically, storing results in `./results/`.
85+
86+
### Important: Test Set Evaluation
87+
88+
**Note:** The `openevolve-run` command only evaluates programs on the **training set** during evolution. To compute final metrics on the **test set**, you must explicitly run:
89+
90+
```bash
91+
python evaluator.py "path/to/generated_program.py"
92+
```
93+
94+
The `evaluator.py` script, when run in `__main__` mode, computes metrics on the held-out extrapolation test set, which is the proper way to evaluate the discovered scaling laws' predictive capability.
95+
96+
------
97+
98+
## Data Format & Evaluation
99+
100+
Each task is formulated as a scaling-law discovery problem containing:
101+
102+
1. **Features ($X$):** Input variables (e.g., $N, D, \text{LR}, \text{Batch Size}$).
103+
2. **Targets ($y$):** Performance metrics (typically training or validation loss).
104+
3. **Groups:** Control indices representing distinct experimental settings (e.g., different model architectures) that share the law *form* but require distinct fitted *parameters*.
105+
106+
### The Evaluation Process
107+
108+
1. **Splitting:** The evaluator partitions data into **training** and **extrapolation test** sets. The largest models or datasets are explicitly held out to mirror real-world forecasting needs.
109+
2. **Fitting:** The `fit_scaling_law` function optimizes parameters on the training portion for each group.
110+
3. **Scoring:** The fitted law is applied to the test set to compute the following metrics:
111+
112+
- **NMSE:** Normalized Mean Squared Error
113+
- **NMAE:** Normalized Mean Absolute Error
114+
- **$R^2$:** Coefficient of Determination
115+
- **Combined Score:** A single scalar summary (currently equivalent to $R^2$).
116+
117+
*Higher combined scores indicate superior extrapolation quality.*
118+
119+
------
120+
121+
## Evolution Markers
122+
123+
OpenEvolve modifies code explicitly wrapped in evolution blocks. The agent evolves the symbolic form and the optimizer simultaneously:
124+
125+
Python
126+
127+
```
128+
# EVOLVE-BLOCK-START
129+
def scaling_law_func(data_points, params):
130+
# Returns predicted values given inputs and parameters
131+
pass
132+
133+
def fit_scaling_law(data_points, loss_values):
134+
# Optimizes parameters to fit the scaling law
135+
pass
136+
# EVOLVE-BLOCK-END
137+
```
138+
139+
The system mutates these blocks, evaluates them via `evaluator.py`, and maintains a database of the highest-performing implementations.
140+
141+
## Requirements
142+
143+
Bash
144+
145+
```
146+
pip install datasets numpy scipy
147+
# Ensure the latest version of openevolve is installed
148+
```
149+
150+
## Citation
151+
152+
If you utilize SLDBench, this example, or derived results in your work, please cite the original paper:
153+
154+
```
155+
@article{lin2025sldbench,
156+
title = {Can Language Models Discover Scaling Laws?},
157+
author = {Lin, Haowei and Ye, Haotian and Feng, Wenzheng and Huang, Quzhe and
158+
Li, Yujun and Lim, Hubert and Li, Zhengrui and Wang, Xiangyu and
159+
Ma, Jianzhu and Liang, Yitao and Zou, James},
160+
journal = {arXiv preprint arXiv:2507.21184},
161+
year = {2025}
162+
}
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Configuration for data constrained scaling law discovery with OpenEvolve
2+
max_iterations: 50
3+
checkpoint_interval: 1
4+
log_level: "INFO"
5+
random_seed: 42
6+
7+
# LLM configuration
8+
llm:
9+
primary_model: null
10+
primary_model_weight: 1.0
11+
secondary_model: null
12+
secondary_model_weight: 0.0
13+
api_base: ""
14+
max_tokens: 16384
15+
timeout: 240
16+
retries: 10
17+
retry_delay: 10
18+
19+
# Prompt configuration
20+
prompt:
21+
system_message: |
22+
You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between training data characteristics and model loss under data-constrained conditions.
23+
24+
**IMPORTANT: The scaling law function must use no more than 7 parameters.**
25+
26+
Focus on mathematical accuracy across different data scales, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability.
27+
28+
**DATA CHARACTERISTICS (182 total data points):**
29+
- Features: [unique_tokens, params, tokens] - 3D input
30+
- Labels: loss - scalar output
31+
- Dataset size: 161
32+
- Parameter range (P): 1.1e8 to 1.1e9 (100M to 1.1B parameters)
33+
- Token count range (D): 1e9 to 1e12 tokens
34+
- Unique tokens range: 1e7 to 5e8 unique tokens
35+
- Loss range: 1.8 to 7.2 (cross-entropy loss)
36+
- Model architectures: Transformer variants with different parameterizations
37+
- Data explores scaling under token/unique-token constraints
38+
39+
The function signatures must remain:
40+
41+
```python
42+
def scaling_law_func(data_points, params):
43+
# data_points: (N,3) array with columns [unique_tokens, params, tokens]
44+
# tokens: Array of token counts
45+
# params: Array of parameter counts
46+
# unique_tokens: Array of unique token counts
47+
# params: Array of up to 7 parameters
48+
# Returns: Predicted loss values
49+
50+
def fit_scaling_law(data_points, loss_values):
51+
# data_points: (N,3) array with columns [unique_tokens, params, tokens]
52+
# loss_values: Array of corresponding loss values
53+
# Returns: Optimized parameters (up to 7 parameters)
54+
```
55+
56+
Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers.
57+
58+
You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc.
59+
60+
num_top_programs: 3
61+
num_diverse_programs: 2
62+
use_template_stochasticity: true
63+
64+
# Database configuration for evolution
65+
database:
66+
population_size: 100
67+
archive_size: 50
68+
num_islands: 5
69+
migration_interval: 25
70+
migration_rate: 0.1
71+
elite_selection_ratio: 0.1
72+
exploration_ratio: 0.2
73+
exploitation_ratio: 0.7
74+
feature_dimensions: ["combined_score", "complexity", "diversity"]
75+
feature_bins: 10
76+
77+
# Evaluator configuration
78+
evaluator:
79+
timeout: 600
80+
max_retries: 3
81+
cascade_evaluation: false
82+
cascade_thresholds: [0.3, 0.6]
83+
parallel_evaluations: 4
84+
use_llm_feedback: false
85+
86+
# Evolution settings
87+
diff_based_evolution: false
88+
max_code_length: 100000
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Configuration for domain mixture scaling law discovery with OpenEvolve
2+
max_iterations: 50
3+
checkpoint_interval: 1
4+
log_level: "INFO"
5+
random_seed: 42
6+
7+
# LLM configuration
8+
llm:
9+
primary_model: null
10+
primary_model_weight: 1.0
11+
secondary_model: null
12+
secondary_model_weight: 0.0
13+
api_base: ""
14+
max_tokens: 16384
15+
timeout: 240
16+
retries: 10
17+
retry_delay: 10
18+
19+
# Prompt configuration
20+
prompt:
21+
system_message: |
22+
You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between domain mixture proportions and multi-domain loss values across different model sizes.
23+
24+
**IMPORTANT: The scaling law function must use no more than 35 parameters.**
25+
26+
Focus on mathematical accuracy across different model sizes, cross-domain generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability.
27+
28+
**DATA CHARACTERISTICS**
29+
- Features: Domain proportions (5 domains) - array of shape (n_mixtures, 5)
30+
- Labels: Multi-domain losses (5 domains) - array of shape (n_mixtures, 5)
31+
- Dataset size: 80 training (20 per model size)
32+
- Model parameter sizes: 70M, 160M, 410M, 1B parameters (4 separate groups)
33+
- Domain proportions: Each row sums to 1.0 (mixture weights)
34+
- Loss ranges: Domain losses span 1.8-4.2 cross-entropy loss
35+
- Mixture configurations: Systematic exploration of different domain weight combinations
36+
- This is a multi-output regression problem with correlated domain performances
37+
38+
The function signatures must remain:
39+
40+
```python
41+
def scaling_law_func(data_points, params):
42+
# data_points: (N,5) array with domain proportions for 5 domains
43+
# proportions: Array of domain mixture proportions
44+
# params: Array of up to 35 parameters
45+
# Returns: Predicted multi-domain loss values (N,5)
46+
47+
def fit_scaling_law(data_points, loss_values):
48+
# data_points: (N,5) array with domain proportions for 5 domains
49+
# loss_values: Array of corresponding multi-domain losses (N,5)
50+
# Returns: Optimized parameters (up to 35 parameters)
51+
```
52+
53+
Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers.
54+
55+
You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc.
56+
57+
num_top_programs: 3
58+
num_diverse_programs: 2
59+
use_template_stochasticity: true
60+
61+
# Database configuration for evolution
62+
database:
63+
population_size: 100
64+
archive_size: 50
65+
num_islands: 5
66+
migration_interval: 25
67+
migration_rate: 0.1
68+
elite_selection_ratio: 0.1
69+
exploration_ratio: 0.2
70+
exploitation_ratio: 0.7
71+
feature_dimensions: ["combined_score", "complexity", "diversity"]
72+
feature_bins: 10
73+
74+
# Evaluator configuration
75+
evaluator:
76+
timeout: 600
77+
max_retries: 3
78+
cascade_evaluation: false
79+
cascade_thresholds: [0.3, 0.6]
80+
parallel_evaluations: 4
81+
use_llm_feedback: false
82+
83+
# Evolution settings
84+
diff_based_evolution: false
85+
max_code_length: 100000

0 commit comments

Comments
 (0)