See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch.
ScratchWorld is a comprehensive benchmark for evaluating GUI agents in Scratch, with a focus on precise screen-grounded interaction (especially accurate drag-and-drop) rather than text-only code generation.
- Primitive Mode: Agents use low-level GUI actions (click, drag-and-drop, type, key operations) for fine-grained control.
- Composite Mode: Agents use higher-level semantic APIs that package multiple UI actions into block-editing operations.
The benchmark evaluates performance across four distinct software engineering capabilities:
- 🎨 Create: Synthesizing functional projects from scratch based on natural language descriptions.
- 🐛 Debug: Diagnosing and repairing logical errors in broken scripts to restore expected behavior.
- 🚀 Extend: Implementing new feature modules into existing codebases without disrupting original logic.
- 🧮 Compute: Utilizing Scratch blocks to solve abstract algorithmic problems and mathematical reasoning tasks.
- 2026-02-12: We release our paper, leaderboard, and code:
Project structure:
scratch-bench/
├── Agent-S/ # Agent-S code and dependencies
├── ocr_server/ # OCR service for primitive mode
├── scratch-gui/ # Scratch 3.0 frontend interface
├── scratch-bench-api/ # Backend API server
├── scratchbench/ # Core evaluation framework
├── tasks/ # Task definitions
│ ├── build/ # Build tasks
│ ├── fix/ # Fix tasks
│ ├── mbpp/ # Algorithm tasks
│ └── modify/ # Modification tasks
├── single_step_drag_benchmark/ # Single-step drag benchmark scripts and assets
├── visual_perception_benchmark/ # Visual perception benchmark scripts and assets
├── task_runner.py # Main entry point for running tasks
├── run_single_step_drag_benchmark.py # Run single-step drag benchmark
├── run_visual_perception_benchmark.py # Run visual perception benchmark
├── results.py # Analyze main benchmark results
├── results_visual_perception_benchmark.py # Analyze visual perception benchmark results
├── results_single_step_drag_benchmark.py # Analyze single-step drag benchmark results
├── curate_submit.py # Curate the JSONL for leaderboard submission
└── agent-config.json.example # configuration needed for each Agent
Environment setup has two parts: Node.js is for scratch-gui, and Python is for scratch-bench-api, benchmark runners, and optional OCR/Agent integrations.
Required (for reproducibility): Node.js v25.5.0
Option A (recommended): install via nvm
# install nvm first (see https://github.com/nvm-sh/nvm)
nvm install 25.5.0
nvm use 25.5.0
node -v # should print v25.5.0Option B (no nvm): install Node.js v25.5.0 directly
# download the v25.5.0 installer from https://nodejs.org/ and install
node -v # should print v25.5.0Install and run scratch-gui:
cd scratch-gui
npm install
npm startRecommended: create a conda env with Python 3.10
conda create -n scratch-bench python=3.10
conda activate scratch-benchInstall base Python dependencies:
python -m pip install -r requirements.txtInstall scratch-bench-api (editable for local development):
python -m pip install -e scratch-bench-apiInstall Playwright browser (required by the API):
python -m playwright install chromiumInstall dependencies needed for Agent-S:
cd Agent-S
python -m pip install -e .Install PaddleOCR for running primitive mode (OCR server).
Follow PaddleOCR install docs: https://www.paddleocr.ai/main/version3.x/installation.html to install paddlepaddle and paddleocr.
# example (GPU build, with cu126)
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddleocr==3.3.0Start scratch-gui (terminal 1)
cd scratch-gui
npm start # on port 8601 by default Start scratch-bench-api (terminal 2)
python -m api.main # on port 8081 by defaultStart ocr-server (optional; terminal 3)
cd ocr_server
python main.py --gpu # on port 9090 by default; omit --gpu for CPUSetup .env
OPENAI_API_KEY="your-api-key-here"
LLM_BASE_URL="your-base-url-here"
# Optional: Scratch GUI URL for the API to connect to (defaults to http://localhost:8601?locale=en)
SCRATCH_GUI_URL="http://localhost:8601?locale=en"
# Optional: cap for concurrent sessions (and task_runner parallelism cap). Do not set --parallel above this.
MAX_SESSIONS=100Run a single task in primitive mode
python task_runner.py --model gpt-5 --mode primitive --task_list 1_task.json --max_steps 50 --parallel 1 --use_last_screenshot --agent scratch-agent --env-file .envPreparation:
- Create the env file referenced by the command you choose (e.g.,
.env.dp,.env.gemini). At minimum setOPENAI_API_KEYandLLM_BASE_URL. - If you run Agent-S2 or AWM, copy
agent-config.json.exampletoagent-config.jsonand fill the<YOUR_*_API_KEY>fields.
Main benchmark (all tasks):
# scratch-agent (composite)
python task_runner.py --model gpt-5 --mode composite --task_list all_tasks.json --max_steps 50 --parallel 1 --no_recording --agent scratch-agent --env-file .env
# scratch-agent (primitive)
python task_runner.py --model gpt-5 --mode primitive --task_list all_tasks.json --max_steps 50 --parallel 1 --use_last_screenshot --agent scratch-agent --env-file .envAgent-S2 (primitive):
python task_runner.py --model gemini-2.5-pro --mode primitive --task_list all_tasks.json --max_steps 50 --parallel 1 --use_last_screenshot --agent agent-s2 --tasks_dir tasks --env-file .envAWM:
# composite
python task_runner.py --model gemini-2.5-pro --mode composite --task_list all_tasks.json --max_steps 50 --parallel 1 --no_recording --agent awm --tasks_dir tasks --env-file .env
# primitive
python task_runner.py --model gemini-2.5-pro --mode primitive --task_list all_tasks.json --max_steps 50 --parallel 1 --use_last_screenshot --agent awm --tasks_dir tasks --env-file .envSingle-step drag benchmark (single_step_drag_benchmark):
# baseline
python run_single_step_drag_benchmark.py --model gpt-5 --times 3
# knowledge
python run_single_step_drag_benchmark.py --model gpt-5 --knowledge --times 3
# GT-start
python run_single_step_drag_benchmark.py --model gpt-5 --ground-truth-start --times 3Visual perception benchmark (visual_perception_benchmark):
# full run
python run_visual_perception_benchmark.py --model gpt-5
# small smoke test (one task per tag)
python run_visual_perception_benchmark.py --model gpt-5 --small